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LICENSE. ABACUS CONCEPTS hereby grants to you a nonexclusive license to use the enclosed computer Program 
subject to the terms of this Agreement. The Program and any backup copies may be used only on the single com- 
puting machine owned by you and for your own purposes. 


COPYRIGHT. This Program and accompanying manual are copyrighted and contain proprietary intormation. All rights 
reserved. This Program and manual may not, in whole or in part, be copied, photocopied, reproduced, translated or 
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Manual Copyright: 
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Chapters 3,4.5 © Abacus Concepts, Inc. 
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chase of this License. This warranty does not cover defects due to accident, abuse, service or modification by any 
unauthorized person, or any cause occurring after initial delivery of the medium to Licensee. THIS WARRANTY GIVES 
YOU SPECIFIC LEGAL RIGHTS, AND YOU MAY ALSO HAVE OTHER RIGHTS WHICH VARY FROM STATE TO STATE. 


LIMITATION OF IMPLIED WARRANTIES. ALL IMPLIED WARRANTIES WITH RESPECT TO THE RECORDING 
MEDIUM, INCLUDING THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PUR- 
POSE, ARE LIMITED IN DURATION TO NINETY (90) DAYS FROM THE DATE OF RETAIL PURCHASE OF THIS 
LICENSE. SOME STATES DO NOT ALLOW LIMITATIONS ON HOW LONG AN IMPLIED WARRANTY LASTS, SO THE 
ABOVE LIMITATION MAY NOT APPLY TO YOU. 


DISCLAIMER OF WARRANTY FOR SOFTWARE. Even though Abacus Concepts has tested and reviewed the sortware 
and documentation, ABACUS CONCEPTS’ SOFTWARE IS LICENSED ON AN ‘‘AS IS’’ BASIS. THIS MEANS THAT THE 
ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS ASSUMED BY LICENSEE. EXCEPT AS 
MAY BE PROVIDED OTHERWISE IN THIS AGREEMENT, SHOULD THE SOFTWARE PROVE DEFECTIVE FOLLOWING 
ITS PURCHASE, LICENSEE, AND NOT ABACUS CONCEPTS OR ITS AUTHORIZED DISTRIBUTORS OR AGENTS, 
ASSUMES THE ENTIRE COST OF ALL NECESSARY SERVICE, REPAIR, OR CORRECTION. ABACUS CONCEPTS 
DISCLAIMS ALL IMPLIED WARRANTIES FOR THE SOFTWARE, INCLUDING WARRANTIES OF MERCHANTABILITY 
AND FITNESS FOR A PARTICULAR PURPOSE. ABACUS CONCEPTS MAKES NO REPRESENTATIONS CONCERNING 
THE QUALITY OF THE SOFTWARE AND DOES NOT PROMISE THAT THE SOFTWARE WILL BE ERROR FREE OR 
WILL OPERATE WITHOUT INTERRUPTION. SPECIFICATIONS OF THE SOFTWARE INCLUDING THE AMOUNT OF 
MEMORY, OR TIME REQUIRED FOR EXECUTION OF ANY PROGRAM MAY BE CHANGED IN NEW RELEASES AND 
VERSIONS. 


LIMITATION OF LIABILITY. IN NO EVENT WILL ABACUS CONCEPTS BE LIABLE FOR ANY DIRECT, INDIRECT, IN- 
CIDENTAL, SPECIAL, CONSEQUENTIAL OR OTHER DAMAGES ARISING OUT OF THE USE OF THE RECORDING 
MEDIUM OR THE SOFTWARE BY ANY PERSON, WHETHER OR NOT INFORMED OF THE POSSIBILITY OF 
DAMAGES IN ADVANCE. ABACUS CONCEPTS’ TOTAL LIABILITY WITH RESPECT TO ALL CAUSES OF ACTION 
TOGETHER WILL NOT EXCEED AMOUNTS PAID BY LICENSEE TO ABACUS CONCEPTS FOR THIS LICENSE. THESE 
LIMITATIONS APPLY TO ALL CAUSES OF ACTION, INCLUDING BREACH OF CONTRACT, BREACH OF WARRANTY, 
ABACUS CONCEPTS’ NEGLIGENCE, STRICT LIABILITY, MISREPRESENTATION AND OTHER TORTS. SOME STATES 
DO NOT ALLOW THE EXCLUSION OF LIMITATION OF INCIDENTAL OR CONSEQUENTIAL DAMAGES, SO THE 
ABOVE LIMITATION OR EXCLUSION MAY NOT APPLY TO YOU. 


TERM AND TERMINATION. This agreement shall continue in force until terminated. Failure of the customer to abide 
by the terms of this license agreement, in particular the prohibition against unauthorized reproduction of the program 
and/or manual will terminate this agreement and result in withdrawal of technical support, and forfeiture of all rights 
under this Customer User Agreement. 


GOVERNING LAW. This Agreement shall be governed by the laws of the State of California. 
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YOU HAVEN'T HEARD THE LAST OF IT YET... 
The following information should be added to the manual: 
Rotated text 


Rotated text will appear on color printouts only in the color of that particular text's first 
letter. 


Multiple lines of rotated text will not print correctly on a lasewriter. 


Changing the font of text in the table view window 
You can now change the font of a table view window. This feature allows you to 


choose a laserwriter font if you are either printing table views to a laserwriter or 
copying entire views that will! later be pasted into documents printed on laserwriters. 


Maintaining the aspect ratio of pasted pictures 


In order to maintain the aspect ratio of pictures pasted into a Graphic view window, first 
select the object, then hold down the shift key and shink or enlarge the object using its 
grow box. | 


By double clicking on a pasted picture whose size has been changed, you will return 
the picture to its original size 


Using your last drawing tool 


Holding down the command key and then clicking in a graphic view window changes 
the arrow cursor to the last used drawing tool. 


No more "read-only" files 


In order to insure compatability with networks, you can no longer open a "Read-Only" 
version of a file. 


Normal axes 
The normal axes option is not implemented in the "Open Axis" dialog box. 
Customizing scattergram views in multiple regressons and factor plots 


Customizations to scattergram views of multiple regressions and factor analysis factor 
plots will disappear when you turn to a new page. 


Running under Multifinder 


StatView Il can operate with a minimum application memory size of 512k. For large 
analyses we suggest using a 768k or larger memory size. 
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The Manual's 
Structure 


Introduction 


StatView II is a statistical data analysis and presentation graphics 
program designed to take full advantage of the Macintosh I's 
numerical co-processor and its 16 million colors. 


With StatView II, analyzing data is simple: enter and select the data, 
select the statistical test to perform, and view the results of the 
analysis. You can accomplish this without having to learn a difficult 
command language or enter command scripts. Because data analysis 
1s an iterative process, Stat View II allows you to quickly and easily 
explore several approaches to your data analysis. You can change 
data values, eliminate outliers, examine subsets, choose new 
variables, select new statistics, and the program will automatically 
recompute your results and redraw your graphs. This process can be 
accomplished in seconds, because StatView II performs its tasks at 
speeds equal to or greater than mainframe computers. 


Since data analysis is more than producing values, StatView II 
provides a presentation graphics package. You can create a variety of 
graphic views of your data and customize them to suit your needs. 
StatView II contains all the tools you need to add emphasis to your 
charts. These tools include text, arrows and shapes, choice of fill 
patterns, text fonts, style and size, and more. Also, StatView I is 
designed to use the more than 16 million colors available on the 
Macintosh IL. 


With StatView II, one program allows you to manage and analyze 
your data, and then produce presentation quality output which 
clearly and concisely expresses your results and conclusions. 


Chapter 1, Quick-Start, is a start-up tutorial that offers important 
information on using StatView II and then runs through sample 
analyses and graphic customization. After reading this brief chapter, 
you will be familiar with several important StatView II concepts. 


Chapter 2, Using StarView J], is an extended tutorial that is meant to 


be read at the computer. This chapter demonstrates the operating 
capabilities of StatView II in detail. 
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2 Introduction 


Chapter 3, Drawing with StatView II, provides detailed information 
on how to create and modify graphs. This chapter allows you use 
Stat View II's powerful presentation graphics. 


Chapter 4, A Detailed Look at the Describe Menu, provides detailed 
information about each of the choices in StatView II's Describe 
menu. It also includes sample problems for each choice. 


Chapter 5, A Detailed Look at the Compare Menu, provides 
information about each of the choices in StatView II's Compare 
menu. It also includes sample problems for each choice. 


Chapter 6, Specialized Data Handling, discusses methods of 
massaging and manipulating data. The chapter covers transforming 
variables, creating new variables by formula, recoding variables, 
sorting data, creating series of values, and splitting columns. 


Appendix A, StarView I] Memory Limits, provides information on 
the size limits of Stat View II datasets and error messages related to 
memory. 


Appendix B, Formulae and References, lists all the formulae used in 
StatView II as well as the books and journal articles from which they 
came. 


This manual assumes that you are familiar with standard Macintosh 
terminology and operation. If you are unfamiliar with the Macintosh 
terms and actions used in this manual consult the your Macintosh 
owner's guide. 


StatView II contains several hardware-dependent features. The 
following StatView II terms are used in this manual: 


Color systems 


All Macintosh IIs (and future machines) with monitors set to 16 or 
more colors or shades of gray. 


Non-color systems 


Black-and-white Macintoshes (SE and Plus) as well as Macintosh IIs — 
with monitors set to less than 16 colors or grays. _ 


Old Quickdraw 
The release of Quickdraw that is available on the Macintosh Plus and 


the Macintosh SE. It supports only 8 colors: black, red, green, 
yellow, blue, magenta, cyan, and white. 


What You Need To 
Get Started 


Numeric co-processor 

A dedicated hardware chip which speeds up numerical computations. 
The Macintosh uses either the model] 68881 or 68882 (available first 
quarter 1988). 

Small screen system 


A system with a screen width less than 640 pixels. This is the size of 
a Macintosh Plus or SE screen. 


‘Large screen system 


A system with screen width greater than or equal to 640 pixels. This 
is the size of the standard Macintosh II monitor. 


The version of StatView II shipped with this manual requires a 
Macintosh with the following components: 

¢* a 68020 (or later) central processor 

* a68881 (or later) numeric co-processor 

* amunimum of ] megabyte of main memory (or RAM) 

¢ Macintosh Operating System version 4.1 or later 

¢ ahard disk (strongly preferred) or two 800K floppy drives 
The following Macintosh systems satisfy these requirements: 

¢ the Macintosh II 


¢ the Macintosh SE with a third-party 68020/68881 accelerator 
board 


¢ the Macintosh Plus with a third-party 68020/68881 
accelerator board 


If any of these hardware and software requirements are not met, 
StatView II will not operate. Instead, it will notify you of the 
problem when you attempt to start the application. 


If you are using an accelerator board which uses the Macintosh’s 
68000 central processor or its own 68000 (as opposed to the 68920 
central processor), Abacus Concepts offers a special, full-featured 
version of StatView II that wil! operate on these 68000/68881 
Macintoshes. You can obtain this version free of charge by writing 
or calling Abacus Concepts. 
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Your Hardware 
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Recommendations 


We recommend that your Macintosh have a hard disk and that you 
operate StatView II from that hard disk. If StatView II is operated 
from a floppy drive, many of the performance gains that come from 
the 68020 and 68881 processors are negated by the floppy drive's 
slow speed. 


Please note that StatView II does not require a large screen display. 
However, a larger screen is very useful since it allows you to display 
more data and manipulate larger graphs, thereby increasing your 
efficiency. 


Although StatView II is fully compatible with System 4.1, you will 
benefit by installing System 4.2 and MultiFinder onto your 
Macintosh. StatView I is completely compatible with MultiFinder. 
Since StatView II allows full cut, copy and paste of both data and 
graphics, using it with other applications within MultiFinder will 
increase your productivity. 


Three menu bar titles will appear differently depending on your 
Macintosh screen size: 






Large screen system Small screen system 
a ~" Re |) |= ere 


The Color choice in the Graph menu will change depending on 
whether the Macintosh you are using is a color or non-color system. 
On a non-color system, the Color choice will contain a list of 8 
colors: black, red, green, yellow, blue, magenta, cyan, and white. On 
a color system, the Color choice will display a palette of 16 colors if 

your monitor is using 16 colors, or 32 colors if your monitor is using 
256 colors. You can customize these palettes to include any of the 
Macintosh I 16 million colors. 






Printing 


Compatibility with 
StatView and StatView 
512+ 


StatView II will work with any printer, color plotter, and slide maker 
for which Macintosh drivers are available. Printing quality depends 
on the type of printer and driving software. StatView II exploits the 
full capabilities of the ImageWriter, ImageWniter II, Image Wniter 
LQ, LaserWriter and LaserWniter Plus. 


StatView II constructs its tables, text, and graphs in a resolution- 
independent manner; therefore, printouts are generated to the full 
resolution of your printer. 


There are several options available for getting hardcopy output of 
your StatView IJ color graphs. You can produce color output by 
printing with an ImageWriter IJ or ImageWriter LQ using a color 
ribbon. Since a Macintosh II] can generate more colors than most 
printers, the colors you print will often be an approximation of the 
colors on your screen. Any color plotter or side making hardware 
that 1s compatible with the Macintosh II will also produce color plots 
of your graphs. 


StatView II opens and analyzes data files built by StatView 512+ 
and StatView. StatView II and StatView 512+ do not differ in their 
data analysis organization. 


StatView II and StatView, however, require different data 
organization for several statistics which analyze grouped data. The 
following statistics compare two groups: 


Unpaired t-test 
Mann-Whitney U test 
Kolmogorov-Smirnov test 
Wald-Wolfowitz runs test 


In StatView, one group was placed in an X column and the other 
group was placed in a Y column. In StatView I, all the data is 
placed in one dependent column and a second column used to 
specify the groups. 


In addition, all analysis of variance statistics (including the Kruskal- 
Wallis test) have been changed so that groups are specified by 
grouping columns and data appears in dependent columns. 
Information for setting up these statistics is provided in Chapter 5. 
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Data generated as ASCII text files from other Macintosh 
applications, as well as from applications on other computers, can be 
directly imported into StatView II. Data from StatView II can also be 
saved as ASCII text files and transferred to other applications. You 
can also use the Clipboard to transfer data to and from other 
Macintosh applications. 


Graphs created by StatView II can be exported to other Macintosh 
applications. You can copy StatView II graphs onto the Clipboard 
and paste them into other applications (such as Microsoft Word, 
MacWrite, Pixel Paint, MacDrawand PageMaker). 


Your analysis results can also be transferred to other applications. 
You can either copy analysis tables as a PICT into the Clipboard and 
transfer them as a picture to other applications, or you can copy the 
values as text to move into spreadsheets, word processors, or other . 
analysis packages. 


Details on the transfer of data is provided in Chapter 2 and on the 
transfer of graphs and analysis results are provided in Chapter 3. 


Chapters 4 and 5 of this manual analyze several sample datasets 
included on your StatView II diskette. 


The discussion of the descriptive statistics and several of the 
comparative statistics uses the Lipid Data dataset. The data were 
provided by Dr. Terence T. Kuske, Professor of Medicine and 
Associate Dean for Curnculum, Medical College or Georgia, 
Augusta, GA. 


The data are blood lipid screenings of medical students at the 
Medical College of Georgia. Blood lipid levels and other 
cardiovascular risk factors (cigarette smoking, hypertension, family 
history of coronary heart disease) are evaluated in freshman and later 
as seniors. This program personalizes education in lipid metabolism, 
prevention or cardiovascular disease, and management of 
hyperlipidemia by diet and medication. 


Lipids include cholesterol and triglycerides and their lipoprotein 
carriers in blood, very-low, low, and high density lipoproteins 
(VLDL, LDL, and HDL). Cardiovascular risk is increased 
proportional to all these parameters except for an inverse relationship 
to HDL cholesterol. This study measures cholesterol, triglycerides 
and HDL cholesterol. A factor allows an estimate of VLDL 
cholesterol from the triglycerides value. Subtraction of VLDL and 


HDL cholesterol from total cholesterol yields a calculated LDL 
cholesterol] value. 


“Healthy” values for adults are: 

<200 mg/dl Total cholesterol 

<]140 mg/d] LDL cholesterol 

>50 mg/dl HDL cholesterol 

<150 mg/dl Tnglycerides 
Dietary modification with medication considered is indicated for 
cholesterol, LDL and tnglycendes that exceed 240, 190 and 350 
respectively, especially when other risk factors are present. Exercise 
can raise HDL. 
The discussion on the ANOVA analyzes three datasets from Winer, 
B. J., Statical Principles in Experimental Design , © 1971, McGraw- 
Hill, New York, New York and one dataset from Afifi , A. and Azen, 
S., Statistical Analysis: A Computer Oriented Approach, © 1979, 
Academic Press, Orlando, Florida 
The discussion on Factor analysis analyzes the Eight Physical 


Variables data from Harmon, Modern Factor Analysis, © 1967, 
University of Chicago Press, Chicago, Illinois. 
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Installing StatView IT 


Overview 


Chapter I — Quick-Start 


The purpose of this quick start is to show you how quickly and easily 
StatView II produces statistical analyses and how the program lets 
you experiment interactively with analysis results to produce the 
most meaningful view of your data. Take a moment to read the 
Quick-Start at your Macintosh before reading the details about 
StatView II operation. 


StatView II comes on a single 800K diskette. If you are using a hard 
disk system, simply copy the files on the diskette to your hard disk. 
If you are using a floppy-based system, make a copy of the master 
diskette and use that copy in your work. Never use the master disk in 
your work. StatView II is not copy-protected. 


StatView I holds data in spreadsheet format. The columns of the 
spreadsheets represent variables (such as age, gender, weight, etc.) 
and the rows represent cases (such as patients). 


By assigning columns as X or Y variables, you direct StatView II to 
use these variables in an analysis. Analyses are chosen from either 
the Compare (comparative statistics) or the Describe (descriptive 
Statistics) menus. Often a dialog box follows an analysis selection 
allowing you to specify particulars regarding the analysis. 


After the dialog box parameters are set and the box is closed, 
analysis results are obtained by choosing a view from the View 
menu. Views are presented in a separate window from the data. Each 
dataset window has an associated view window. 


The following examples, although they touch on only some of 
StatView II's many operational features, demonstrate one of the 
program's most important features: interactivity. Analysis results are 
recalculated in an instant to reflect the following types of interaction: 


¢ view changes (tables, scattergrams, box plots, pie, bar, and 
line charts) 
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¢ new data selection (selecting different data to include in 
analysis) 


* including or excluding rows from analysis 
fo 


¢ range restrictions within data being analyzed 


E xample 7 ye a steps you through the normal methods you use to run 
Running an Analysis 


Opening a Dataset 


*> Choose Open from the File menu. The standard Macintosh open 
file dialog box appears. 


«> Open the folder titled Sample Data. In that folder is a file titled 
Lipid Data. Choose it, and a window opens on the desktop. 


*> Choose Zoom Up from the Window menu. The dataset, which 
now fills the entire screen, should look like this: 
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Select a Column for Analysis 


*> Click once on the column heading Weight. The entire column 
under Weight becomes black (highlighted). 


*> Select Choose X from the Vars menu. A small Xq appears at the 


top of the Weight column under the column title. The data 
window now looks like this: 
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Choosing a Statistic 


«> Choose Mean, Std. Dev., etc... from the Describe menu. 


Viewing Analysis Results 


‘> Choose Table from the View menu. A window titled View of 
Lipid Data opens on the desk top. It contains this table showing 
the results of your analysis: 


levis: ——————a = 


X1: Weight 
Mean: Std. Dev.: Sid. Error: Variance : Coef. Var.: 


a gos.931 | 17.894 me ama 
imum : imum : ance: 





Running the Analysis on Different Data 


¢> Click in the Lipid Data window. It becomes active covering the 
View window entirely. 


«> Double-click on the column name Cholesterol. This column 
now becomes the X72 column. 


*> Choose View of Lipid Data from the Window menu. The View 
window appears. It is now a standard Macintosh paging window 
with the results for column X1 Weight on the first page and 


column X7 Cholesterol on the second page. 
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Example #2 — 
Creating a Dataset 
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Running the Analysis on All Data Columns 


‘> Choose Quick Assignment from the Vars menu and the 
following dialog box appears: 


Quick Assignment for Lipid Data 
H Variables: 


Unassigned Variables: 


Gender 
Age 
Triglycerides 


Weight 
Cholesterol 
7 
HDL - 
LDL : 
i ¥ Variables: 





% ideal body ... |. 
Height Be 
Skinfold 





*> Click on Gender in the Unassigned Variables scrolling list and 
drag the cursor down highlighting the entire list. 


*> With the list highlighted, click the Choose X button. Now all the 
column names are in the X Variables scrolling list. 


‘> Click Done, and the view window becomes a paging window 
containing results of the selected analysis for all the columns in 
the data set. 


Believe it or not, running StatView II analyses is as simple as that! 
Let's run another test. This time you are going to build the data file. 
First, close the two windows (Lipid Data and View of Lipid Data) by 
clicking in the close box in each window's upper left corner. 


This example steps you through the normal methods you use to 
create a new dataset. 


Creating a New Data File 


«> Choose New... from the File menu. This dialogue box titled New 
Data Column Information opens on the desk top: 


New Date Column Information 


Type: @integer © real © long 
© category © string 


Decimal Places: 


Od O01 O2 ©%3 O+ 05 O68 O? OB OF 





*> Type Mean 1. It replaces Column 1 in the Name box. 


*> Click the Type button next to real (since the data you will enter 


has both integers and decimals). Click More. 


*> Type Mean 2. It replaces Column 2 in the Name box. Click the 


button next to real. Click Done. This data file window named 
Untitled-1 opens: 


Untitled-1 





Entering Data 


bl 


es 


oa 


*> 


Click in the cell beneath Mean 1. 


Type the following data pressing the Return key after each 
number is entered. Type: 2 (Return), 3.4 (Return), 11 (Return), 
23 (Return), 3.41 (Return). The data appears in the column you 
titled Mean 1. If you make a mistake, click in the cell and retype. 


Click in the cell beneath Mean 2. 


Type the following data pressing the Return key after each 
number is entered. Type: 2 (Return), 4 (Return), 3.4 (Return), 5.7 
(Return), 11 (Return), 15 (Return), 23 (Return), 3.41 (Return), 
and 4.8 (Return). The data appears in the column you titled Mean 
2. If you make a mistake, click in the cell and retype. 
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Example #3 — 


Customizing a Graph 


14 


Chapter 1 — Quick-Start 


Assigning Variables 


*> Click once on the title Mean 1 at the top of the first column. The 
column turns black (highlighted). 


*> Select Choose X from the Vars menu. Xj appears under the title 
Mean. 


*> Click once on the title Mean 2 at the top of the second column. 
The column turns black (highlighted). 


*> Select Choose X from the Vars menu. X> appears under the title 
Mean 2. 


Choosing Statistic 


‘> Choose Mean, Std. Dev., etc... from the Describe menu. 


Viewing Results 


*> Choose Table from the View menu, and a window titled View of 
Mean opens with the results of the analysis. 


*> Choose Scattergram from the View menu. 
‘> Choose Zoom Up from the Window menu. 


The view window updates changing the tabular display to an 
error bar display comparing the two columns’ means. 


This example steps you through creating and customizing a chart . 


Creating a Graph 

*> Choose Open from the File menu. 

*> Open the folder titled Sample Data. A file titled Lipid Data is 
found in that folder. Choose it, and a window opens on the 
desktop. 


*> Choose Zoom Up from the Window menu. 


*> Click once on the column heading Weight. The entire column 
under Weight becomes highlighted. 


> 


al 


> 


Select Choose X from the Vars menu. A small X 1 appears at the 
top of the Weight column under the column title. 


Click once on the column heading Cholesterol. The entire 
column under Cholesterol] becomes highlighted. 


Select Choose Y from the Vars menu.A small Yq appears at the 
top of the Cholesterol] column under the column title. 


Choose Scattergram from the View menu, and this window 
titled View of Lipid Data opens displaying a scattergram of the 
Cholesterol values plotted against the Weight values: 


Scattergram for columns: %1Yj 


300 


O Cholestero] 


Cholesterol 





Customizing the Graph 


— 


wl 


bal 


ntl 


Choose Zoom Up from the Window menu. 
Select the X axis by clicking on it. 


Select the Y axis as well by holding down the shift key and 
clicking in It: 


od 
.280: 
(240: 
1220: 
ee 
= '200' 
er 
2 i180, 
& (8°) 
160; 
140i 
1120: 
ih! 
| SE | 


Choose Font from the Text menu and a menu pops out from the 
side. Choose Chicago as the font: 
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es) 
fo) 


Cholesterol 


160 
140 
120 
100 


*> Select the Square drawing tool from the palette on the left of the 
graph. The cursor will change to a cross hair: 


+ 


*> Click the mouse on the graph and drag to draw a square: 


sen 

O 

0,9 8 o> 

OF OP 6S ie 

D5 200 “99 0%? 
~ ~~ a 


*> Click the the plot symbol in the legend. 


*> Choose Point Type from the Graph menu and select the X 
point. The graph is redrawn with a new plot symbol: 


Scattergram for columns: x1‘ 


300 
280 





100 120 140 160 180 200 220 240 
Weight 


*> Choose Color from the Graph menu. 


If you are running on a non-color system you will see a list of 
eight colors. Choose Red. The graph will be redrawn still 


Example #4 — 
Transforming 
Columns 


showing the data points as black; nevertheless, they will print as 
red on a color output device. 


If you are running on a color system you will see a palette of 
colors. Choose a new color for the data points. The graph will be 
redrawn with a new color for these points. 


Experiment now with changing the attributes of other items on the 
graph. You can assign a color to any object on the screen. Try 
drawing different shapes on your graph. You can see how easy it is 
to customize your graph using StatView IL. 


The following steps demonstrate some of the data handling features 
of StatView IL. 


¢> Zoom Untitled-1 to full screen size using the Window menu by 
choosing Untitled—1 from the Window menu, and then choosing 
Zoom Up after it (Untitled-1) 1s checked. 


*> Choose Transform from the Tools menu, and this dialog box 
appears: 


Sart 
[x] 

us 
HN 
In(x) 
In(1+x) 


Decimal Places: 


O8 Of O2°@3 O4 O5 O6 O7? O8 OSD 





Notice that there are two column names, Mean 1 and Mean 2, in 
the Select Column scrolling list (because there are two columns 
in the dataset Untitled-1). 


The text entry box labelled Name at the top of the dialog box 
displays the default name 1/x of Mean 1 because the current 
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°> 


ai 






transformation selection calls for the new column to be the 
function 1/x of the column titled Mean 1. 


Click on x*2 in the Transformation selection list, and the 
default name for the column to be created by the transformation 
changes to x*2 of Mean 1. 


G lick the button labelled Transform and a new column titled 
x2 of Mean 1 is added to the dataset. The column also appears 
in the Select Columns list. 

Click the Exit button, and the dialog box closes. 


The window Untitled-1 now has three data columns, Mean 1, 
Mean 2, and x*2 of Mean l. 


Choose Formula... from the Tools menu, and the following 
dialog box appears 





None-- - >| 
aaa 
Wea 
Operand 1: ©none-  gperand 2: 










Mean 2 
K*2 of Mean 1 





H*2 of Mean |] 








Decimal Places: 


O0 Ol O02 ©3 0405 06 07 08 O39 


Click on Mean 1 in the Operand 1 list and it becomes 
highlighted. Click on Sqrt in the function list at the top center of 
the dialog box, and then click the button of Op1. Notice that the 
label above the Operand | list now reads Sqrt of with the 
column Mean I highlighted below it. 






Click on Mean 2 in the Operand 2 list and it becomes 
highlighted. 


Choose the multiplication (*) radio button between the two 
operand selection lists. 


Change the name of the column to be created from Column 4 to 
Sqrt Mean 1 * Mean 2. 


*> Click Calculate. The column just created is added to the dataset 
and appears in both the Operand 1 and Operand 2 lists. 


‘> Click Exit and the dataset Untitled-1 appears again with the new 
column added. 


Experiment with the Formula and the Transformation dialog boxes if 
you wish. Please do not save any changes made to the file Lipid Data 
since that file will be used for demonstration purposes in this 
documentation. 


Please close all windows before starting Chapter 2. 


Chapter 1 — Quick-Start 19 


Introduction 


Chapter 2 — Learning StatView II 


This chapter is an extended tutorial that should be read through at the 
computer for optimum value. Most of the operational features of the 
program are covered here. Specifically, after reading this chapter, 
you will know how to: 

* open and create a StatView II dataset 

* resize dataset windows and columns 

* enter and edit data in a dataset 

¢ graph data 

* assign variables and run analyses on data 

* view the results of analyses 

* customize analysis views 

¢ include and exclude cases from analyses 

¢ set and edit ranges in analyzed data 

* save datasets and analysis results 

* move datasets and results to other applications. 

¢ print datasets and analysis results 
The information in this chapter is of a general nature, and, as such, 
should be understood regardless of the analysis you will be doing. 
Chapter 3 goes into more detail on how to create and modify charts. 
Chapters 4 and 5 provide specific information for each analysis, and 


can be used as a reference source. Chapter 6 provides information on 
data transformations and handling. 
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The Data Window Follow the steps below to open a sample dataset contained on the 
StatView II disk. 


Opening Datasets 


*> Choose Open from the File menu, and the standard Macintosh 
open-file dialog box appears. 


*> Click on the Drive button to locate the disk where the Sample 
Data resides. 


*> Open the folder titled Sample Data. In that folder is a file titled 
Lipid Data. Choose it, and a window opens on the desktop. 


Dataset Mechanics 


‘> Choose Zoom Up from the Window menu, and the dataset fills 
the entire screen. 


The Window 


StatView II datasets are standard Macintosh windows: 





8S lipid Oates BSaSq—= SS] 


; [Name sender [age Weight | Cholesterol “ria 
DVRS) GEE Yes Sr Hen IE Sr 


210. a a aT: | | ACN 
3] 0s. duintent | omale| 22] 90] 90f 
| 4] fh. Beal | femate] 22) ISPS 





/_7|M. Mubroid___| male] 23] 184) 194] = 
9] €. Norman | __male]| 23] 178 234 
Toy a.s. Smith Jr. | male] 22] 158 201 a 
Ci3[M-Lumpote | mate| 22 123] 137)” 
Cta[0.Finemen | femete| a7] 138] zs 
aL aie ~maie| eM 


The scroll bars on the nght side and bottom of the window allow you 
to scroll through the window. The grow box in the bottom nght 
corer changes the window's size. The close box in the upper left 
corner closes the window. 
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Changing Window Size 


There are four ways to change the size of StatView I windows. 


The grow box in the bottom right corner of dataset windows 
operates in the standard Macintosh manner. Clicking on this 
box and dragging either enlarges or shrinks the window. 


The toggling menu choice Zoom Up/Down on the Window 
menu enlarges or shrinks the active window. If the active 
window 1s enlarged, the menu choice is Zoom Down. If the 
active window 1s less than full screen size, the menu choice is 
Zoom Up. 


If Zoom Down is chosen, the window contracts to the size 
and screen location it held before it was expanded. 


If Zoom Up is chosen and the active window is a dataset, the 
window expands to full screen size. If Zoom Up is chosen 
and the active window is a graphic view window, the window 
expands to either full screen size or to the maximum width 
that allows it to be pasted into a MacWrite document. 


On a small-screen Macintosh, these sizes are the same. On a 
large-screen Macintosh, they are different. In this case, 
determine the size to which active view windows are zoomed 
up, by choosing Preferences from the Tools menu. Radio 
buttons labelled View Zoom Preference in the bottom left 
side of this dialog box allow selection of either full screen 
size or MacWrite size. 


Double—clicking on the title bar of a window has the same 
effect as choosing Zoom Up/Down. 


Clicking in the Zoom box on the right side of the title bar has 
the same effect as choosing Zoom Up/Down. 


Changing Window Position 


The position of any window can be changed by clicking on the 
window's title bar and dragging the window to a new location. 


Multiple Windows 


Eight datasets can be open on the desktop at the same time. Only one 
dataset can be active (have analyses or data entry/editing performed 
on it) at any one time. 


Each dataset may have an associated view window open. Thus, as 
many as 16 StatView II windows may be open at once. 


Chapter 2 — Learning StatView I] 23 


Dataset Structure 


Activating Windows 


If you have eight datasets and eight view windows open, it can be 
difficult to find a particular window. There are several ways to 
activate (bring to the front) windows. 


¢ Windows can be activated by clicking in them. 


¢ Windows can be activated by choosing them from the 
Window menu. Each window (dataset or view) that is open 
on the desktop is listed on this menu. The active window is 
checked. 


¢ The view window associated with an active dataset can be 
activated by double-clicking in the blank square in the upper 
left corner of the data set. (This is the square to the left of the 
first column name and above row number 1, not the close— 
window box.) 


If there is no view window open for a dataset, double- 
clicking in this square opens a table view of the selected 
analysis. 


StatView II holds data in spreadsheet format. The columns in a 
dataset define variables and the rows define cases. 


At the top of each column are two rectangles, one above the other. 
The top rectangle contains the user assigned name of the column. 
The smaller rectangle below this contains variable assignments for 
the column. (This rectangle i is blank in all columns in Lipid Data 
now. Variable assignment is discussed below.) 


Resizing Dataset Columns’ Width 

Data columns can be made wider or narrower. 

*> Position the cursor to the nght of a column name on the vertical 
line that separates columns. The cursor becomes a cross with 
arrows on its horizontal bar. 

*> Click on the line and drag it to alter the width of the column. 

If dataset columns are too narrow to display data entered, pound 


signs (###) appear in the cells replacing numeric data and periods 
appear as ellipses replacing undisplayed alpha data. 
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Creating Datasets 


Before data can be entered in StatView II via the keyboard or the 
paste command, a dataset must be created to hold it. Datasets are 
created using the New choice on the File menu. 


*> Choose New from the File menu and this dialog box titled New 
Data Column Information appears over Lipid Data: 


New Date Column Information 


Type: @integer © real © long 
© category © string 


Decimal Plseces: 


OF OO? "eo 7 O+ O8 O68 OF? OF OF 





The dialog box allows you to define columns for a new dataset. You 
specify the name of the column; the type of data to be entered in the 
column; and the number of decimal places displayed in the column. 


Naming Columns 


The text entry rectangle labelled Name allows you to name the 
column being created. Column names can be a maximum of 37 
characters long, with no two columns having the same name. 


Since we are defining columns for a new dataset, this rectangle 
contains the highlighted default name Column 1. Whatever you type 
replaces this as the name for the first column. 


Column names can be changed at any time using the Format menu 
selection on the Tools menu. ; 


«> Name this column Treatment. 


Assigning Column Types 


Below the Name entry box is an array of radio buttons denoting data 
Type. These include: integer, real, long, category, and string. 
Click on the button that describes the data (see below) that will be 
entered in the column being defined. 
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Integer Data 
Integer data comprises whole numbers between 32,767 and -32,767. 


Long Integer Data 


\ 


.) 
Long integer data comprises whole numbers between 2,147,483,648 
and -2,147,483,648. 


Real Data 


Real data has fractional parts. A real number may range from 
-1.1E4932 to 1.1E4932 with a smallest positive number of 
1.9E-4951 and a largest negative number of -1.9E+4951. When you 
choose real data, specify the number of decimal places for StatView 
IT to display the data using the radio buttons labelled Decimal 
Places. 


The program stores and manipulates data to approximately 18 
decimal places even though 9 is the maximum displayed in a cell. 
Displayed decimal places can be changed at any time using the 
Format choice in the Tools menu. 


String Data 


String data is alpha-numeric data entered in cells to document the 
dataset. String columns cannot be assigned variables and, therefore, 
cannot be used in analyses. Cell entries for string data cannot exceed 
80 characters. (The column Name in Lipid Data is a String column.) 


Category Data 

Category data is alpha-numeric data that can be used in analyses to 
group data. Category data entries for each column are defined in a 
set. Only members of the category set selected or defined for a 
column can be entered in the column. 


*> Select category as the data type for the column we have named 
Treatment, and click More. This dialog box appears: 


Please choose the new column's category. 


<> . 
Untitled 


: | Cancel | 
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In the nght side of the dialog box are the buttons File and 
Cancel. Cancel returns you to the New Data Column 
Information dialog box. Clicking File rotates a display of the 
category sets already defined in: 


¢ the dataset being defined (none have been defined so far in 
this case). Notice the name of the dataset Untitled—1 1s 
displayed above the File button and that the scrolling 
selection list for category sets 1s empty. 


¢ the StatView IJ Set Library (there is one defined already 
titled Default). The name Library is displayed above the 
File button when the Library sets are displayed. 


¢ for any other dataset open on the desktop. In this case, Lipid 
Data is open. It has four category sets: Sex, Smoking, 
Alcohol Frequency, and Heart Attack. 


To the left of the File and Cancel buttons are the buttons Select 
and New. Select is available when a set 1s highlighted in the 
scrolling list to the left. Clicking Select chooses the highlighted 
set for the category column and returns you to the New Data 
Column Information dialog box. The New button allows the 
definition of a new category set. 


Click New and the following dialog box appears: 











Create 6 New Category 


Category Name: |Meat UE 
Element Name: [Element 
r  Rdd Heplace Delete 


(tone) suntitiea-1 





The top text entry rectangle is labelled Category Name and has 
the default entry Category 1. The name you enter here 1s the 
name that is displayed in the previous dialog box (for the dataset 
Lipid Data the names of the category sets were Sex, Smoking, 
Alcohol Frequency, and Heart Attack). The maximum size for a 
Category set name is 20 characters. 


Enter Dosage as the category name, and press the Tab key. 
The text entry rectangle below Category Name is labelled 
Element Name. Names entered here are the specific alpha- 


numeric entries that can be displayed in dataset cells. Element 
names cannot exceed 20 characters. 
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*> Enter Low and click the button labelled Add (or press Return). 


"Low" appears in the scrolling list to the left. Each element you 
add to the set appears here. 


*> Click on Low in the scrolling list to the left, and it becomes 
highlighted. 


Notice that the buttons next to Add labelled Replace and Delete 
change from gray (non-selectable) to black (selectable) when an 
element is highlighted in the scrolling list. Any text entered in 
the Element Name rectangle replaces Low if Replace is clicked 
when Low is highlighted. Low is deleted if Delete is clicked 
when Low is highlighted. 


*> Enter Mledium for Element Name 2. Click Add or press Return. 
‘> Enter High for Element Name 3. Click Add or press Return. 


The File button at the bottom right of the dialog box is used to 
save the set in either the dataset for which it is being created 
(Untitled-2) or in the StatView II Library. The name above the 
File button changes from the name of the dataset (Untitled-1) to 
Library with each click of the File button. 


Note: If a set is saved in the Library, it can be selected for use in 
any dataset. If it 1s saved with a specific dataset, that dataset must 
be open for the category set to be used to define columns. 
(Remember, because Lipid Data is open, the four sets from Lipid 
Data were available when we defined this category set.) Sets 
saved in the StatView Library cannot be edited. 


‘> Use the File button to choose the dataset named (Untitled—2) as 
the place to save the set Dosage, and click Done. 


The set for column one (type category) is now defined. StatView I 
returns to the New Column Information dialog box with the default 
name Column 2 in the Column name text entry rectangle. 


Notice that the Category set definition dialog box can be used 
without the mouse by pressing the Tab key to advance the cursor and 
the Return key to close the box. 


Editing category sets is discussed below (after the dataset being 
defined is completed). 


Category Data as Ordinal Numbers (Integers) 


Each element in a category set is assigned an ordinal number 
(beginning with one) in the order it was added to the category set. If 
data from a category set is pasted into an integer, long integer, or real 
column, the ordinal number of the element, rather than the element 
name, is pasted. 
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You can assign an X or Y to a category variable and use it in a data 
analysis. 


Changing Data Type 


Data type can be varied for each column in a dataset. Once the type 
is selected for a column it cannot be changed, although data from a 
column can be transferred (with the Cut and Paste command) to a 
column of a different type. (See the section on pasting in this chapter 
for details. ) 


The More and Done Buttons 


At the bottom of the New Data Column Information dialog box are 
two buttons: More and Done. More creates a new column using the 
parameters in the dialog box, and then changes the default name in 
the Column Name rectangle allowing the definition of another 
column. 


Done creates a new column using the parameters 1n the dialog box, 
and then closes the New Data Column Information dialog box and 
opens the newly created file on the desktop. The new file is a dataset 
_ window (like Lipid Data) containing column names and an input 
TOW. 


‘> Change the name of Column 2 to Age. Select data type integer. 
Click More. 


‘> Change the name of Column 3 to Weight. Select data type real. 
Select 2 for the number of decimal places displayed. Click More. 


«> Change the name of Column 4 to Result. Select data type long. 
Click More. 


*> Change the name of Column 5 to Name. Select data type string. 
Click Done. 


A new dataset opens over Lipid Data. 


Editing Category Sets 


Category sets are edited through the Edit Categories choice on the 
Tools menu. Using the edit function, the names of category sets and 
the names of elements in category sets can be changed. Elements 
cannot be added to or deleted from category sets. The dataset that 
contains the category set to be edited must be active when the 
selection Edit Categories is made. 


Category sets saved in the StatView Library are not editable. If sets 
in the StatView II Library contain mistakes, the library icon can be 
thrown out (trashed). All sets in the old Library are lost, however, no 
data or Category sets in datasets that were defined using the 
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discarded Library are affected. When you define a new Library set, a 
new Library icon is generated. 


Note: throwing out your StatView Library will also throw away any 
changes you have made to the Color Palette or the Preferences. 


*> Activate Lipid Data. 


*> Choose Edit Categories from the Tools menu, and this dialog 
box appears: 


Choose which category you wish to edit. 


Smoking 
alcohol frequency 
Heart Attack 





The names in the scrolling list represent all the category sets in 
the active window. The set that is highlighted when Edit is 
clicked becomes available for editing. 


*> Highlight Sex and click Edit. This appears: 


Edit a Category in Lipid Data 


Category Name: 


Element Name: 


(Replace 


male- 


female 


Cancel 





The name of the category set appears in the text entry rectangle 
labelled Category Name. Any name that is in this box when 
Done is clicked becomes the new name for the category set. 


The scrolling list contains all the elements in the category set 
being edited. The highlighted element appears in the text entry 
rectangle labelled Element Name. The element name that is in 
this rectangle replaces the highlighted name when Replace is 
clicked. | 
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Entering and Editing 
Data 


—— eee 


The names in the scrolling list when Done is clicked become the 
names for the category set. Clicking Done returns the program to 
the category selection dialog box. Here, a new category can be 
selected for editing, or Done can be selected to return to the 
dataset. 


Experiment with changing category elements or return to the dataset 
as you wish. Do not save changes made to Lipid Data. 


Data is entered in StatView II datasets in three manners: 


¢ Data can be typed directly into StatView II cells using the 
keyboard or keypad. 


¢ Data can be pasted into StatView II from other Macintosh 
applications through the Clipboard. Entire rows, entire 
columns, or blocks of contiguous cells covering multiple 
rows and columns can be entered this way. 


* Data can be imported into StatView II if it is in text file 
format. Text files can come from other Macintosh 
applications or from other computers (micro, mini, or 
mainframe). 


StatView II data can be edited either by individual cell, by entire 
row, by entire column, or by selected contiguous cells covering 
multiple rows and columns. 


Standard Macintosh cut, copy, and paste-highlighted-selection- 
protocols are followed. 


The Input Row 


*> Activate the empty dataset you have just created. It should look 
like this: 





tl») 
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The only row present in the new dataset is the gray input row. This 
row is always the bottom row of a dataset regardless of the number 
of rows the dataset contains. This row always has as many cells as 

the dataset has columns. 


The input row is where new data is entered in a dataset. 


Entering Data from the Keyboard or Keypad 


*> Position the cursor over the cell in the input row in the Age 
column, and the cursor turns into a cross. 





*> Click in the cell when the cursor is a cross and the cell changes 
from gray to black. 





‘> Type 29 and press the Tab key. The following happen: 
* row l appears, and the input row moves down. 
: is entered in row 1, column Age. 
¢ all cells in row 1 except the cell in the string column and the 
cell you entered data in contain missing value symbols 


(periods). 


* the highlighted cell moves to row 1, column Weight. 
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¢> With the cell in row 1, column Weight highlighted, type 149.5 
and press the Tab key. Notice that since we selected 2 decimal 
places for display, the data entered reads 149.50. 


¢> With the cell in row 1, column Result highlighted, type 100000 
and press the Tab key. 


«> Since Name is a string column, the cell does not highlight, rather, 
a text entry cursor appears. Type J. Smith and press the Tab key 
and the cell in the Treatment column of the input row becomes 
highlighted. 


We have defined the elements of the category set for the Treatment 
column as High, Medium, and Low. There are two ways to enter 
category data: 


1. Typing the first letter of a category element causes that 
element to be entered in a cell if that letter is a unique initial 
letter in the category set. If it is not unique, type as many 
letters as necessary to complete the element name. 


2. Typing the ordinal number for the category element causes 
that element to be entered in the cell. 


*> Type M or 2 and the category set element Medium appears in 
row 2 of the Treatment column. 


Entering Missing Values 

StatView II allows the entry of missing values in datasets. Click in a 
cell highlighting it and type a period. When you move to a new cell, 
the period becomes a large dot indicating a missing value. Missing 
values are treated differently by different analyses. The manner in 
which missing values are treated is specified in this chapter in the 
section The Mechanics of Running Analyses. 

Cursor Movement 


The following keystrokes contro] cursor movement in a dataset: 
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Keystroke Direction of Selection 
Movement 













The arrow keys move the selection one cell. Holding down the 
Command key while pressing an arrow key scrolls a screen a page at 
a time without moving the selection. Thus, Command-Down would 
show you the next page of your data, but would not move the 
selection. 


If you have an extended keyboard (with Page Up and Page Down 
keys), you can also page through the dataset without moving the 
selection. 


Keystroke Direction of Page Movement 



















Preferences for Enter Key Cursor Movement 


*> Choose Preferences from the Tools menu, and this dialog box 
appears: 


Preferences 
Number of Decimal Places for Results: 


O90 O!1 O2 ©3 O4 O5 O6 O7 O8 O89 


View Zoom Preference: Enter Key moves: 


© full screen © right 
@ MaciJrite size @ down 
© doesn't move 





The bottom nght area of the box contains a radio button panel 
which determines the way the cursor moves in a dataset when the 
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Enter key is pressed. The selections allow the cursor to move to 
the nght (like Tab ); down (like Return ); or not at all. 


Pressing Shift-Enter moves the cursor in the reverse direction of 
whatever selection 1s made in the Preferences dialog box. 


«> Click Cancel to return to the dataset. 


Selecting Data 


Data that is copied or cut from a StatView I dataset or from another 
Macintosh spreadsheet can be pasted into a StatView I dataset. The 
Editing operations of cut, copy, paste, and clear always operate on 
selected (highlighted) cells in the data window. 


To highlight cells, click and drag the cursor over the cells to be 
highlighted. Dragging diagonally highlights blocks of contiguous 
cells. If you are selecting a large block of contiguous cells, click in 
the cell in any corner of the block (it becomes highlighted); use the 
scroll bars to locate the cell in the corner of the block diagonally 
across from the selected cell, hold down the Shift key and click in 
the second corner cell. The entire block becomes selected. 


Selecting Entire Rows or Columns 


Clicking once on a row number selects the entire row. (Be careful 
not to click twice on a row number if you wish to select it. This will 
gray out the row number meaning the row is excluded from analysis. 
Details below.) 


Clicking once on a column title or in the variable assignment 
rectangle below the column title selects the entire column. (Clicking 
twice in these areas causes columns assignments to be made or 
unmade. ) 


Multiple rows or columns may be selected be clicking on a row 
number or column head and dragging the cursor over adjoining row 
numbers or column heads. 


Large blocks of rows or columns can be selected by selecting one 
row or column, scrolling to the row or column at the end of the 
group to be selected, holding down the Shift key and selecting the 
row or column at the end of the group. All rows or columns between 
the two selected are highlighted. 


Selecting Data Using the Edit Menu 


The Select All Columns and Select All Rows choices are available 
on the Edit menu. 


Select All Columns is useful either to assign all columns in a data 


set as a single variable (in conjunction with the Choose X or the 
Choose Y choices on the Variables menu) or to clear all columns of 
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variable assignments (in conjunction with the Clear X&Y choice on 
the Variables menu). 


The Quick Assignment selection on the Variables menu (discussed 
below) is a better choice for assigning variables if less than all the 
columns in a dataset are going to have their assignments altered. 


Select All Rows (in conjunction with the choice Include Row from 
the Variables menu) is useful to include rows that had previously be 
excluded from analysis. (Inclusion and exclusion of data from 
analyses is discussed below.) 


Select All Rows or Select All Columns is the easiest way to select 
an entire dataset for copying to the Clipboard. 


Show Selection 


The choice Show Selection from the Edit menu automatically 
scrolls the dataset to display any section of it that is highlighted. The 
cell in the upper left corner of the selected block is positioned in the 
upper left comer of the window. 


*> Activate Lipid Data and experiment with these ways of selecting 
data if you wish. Do not close Untitled 2. 


Copying Data 


Data that is selected is copied to the Clipboard by choosing Copy 
from the File menu. 


Numeric data in the Clipboard is converted to text whenever a desk 
accessory is activated or a new application is switched to in 
MultiFinder. While conversion is taking place, the yin—yang cursor 
is displayed. Conversion can be cancelled by pressing Command-. 
(Command-period) while this cursor is displayed. If conversion is 
cancelled, no data is available in the Clipboard for pasting to a desk 
accessory or another application. 


When data is converted from numeric to text form, it loses all 
decimal places not displayed. If you are pasting data outside of 
StatView I, make sure enough decimal places are displayed in the 
copied cells to fulfill your need. 


Pasting Data into StatView II 


StatView II pastes data from the Clipboard into the selected area of 
the dataset. Unlike spreadsheets, pasting data into StatView II does 
not shift any of the current contents of the dataset. Thus, if the 
Clipboard contains an array of numbers that is 3 columns wide and 3 
rows deep, and the selected area 1s 2 cells wide and 2 cells deep, 
paste does not shift cells to the night of the selected area over, nor the 
cells underneath the selected area down. 
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There are three elements to consider when pasting data: 

¢ The size of selected area 

¢ The type of data being pasted vs. type of selection area 

¢« The location of selection area (input row vs. body of dataset) 
Size of Selected Area 


The selected area can have several shapes in relation to the size of 
the data in the Clipboard. The selected area may be smaller than the 
data in the Clipboard, the exact size as the data, or larger than the 
data. Each situation results in a different paste. 


+ If the selected area is smaller than data in Clipboard, 
StatView II pastes as much data as it can starting in the 
upper/left cell. 


¢ If the selected area is the exact size of data in Clipboard, an 


exact copy of the data 1s pasted. 


- Ifthe selected area is Jarger than data in Clipboard, StatView 
II takes one of two actions depending on the size of the area. 
If the size of selected area is an exact multiple of the data, 
StatView II duplicates the data as many times as necessary to 
fill the selected area. If itis not an exact multiple, only one 
copy of the data is pasted in and remaining cells in selected 
area are filled with missing value symbols. 


This feature is especially useful in coding data. If, for 
example, you want to create a coding column of sevens 
parallel to the first twenty cells of another column, enter 
seven in the first cell of the coding column, copy it, highlight 
the next nineteen cells in that column, and select paste. 
Sevens fill the selected area. 


Type of Data Pasted vs. Type of Selected Area 


The following describes the results of pasting various types of data 
(integer, long, real, category, and string) into parts of a dataset. 


Integer. real, or Jong data pasted into: 


¢ integer, real, or long area will be converted to the destination 


type 


* category area is converted to a set element if that element 
exists with the same ordinal number 


* string area is converted to text 


Category data pasted into: 
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¢ integer, real, or long area is converted to an ordinal value 


* category area is converted to a set element if that element 
exists with the same ordinal number 


¢ string area is converted to text 


String data pasted into: 


* integer, real, or long area is converted to the area type if it 
contains a number; otherwise, a missing value is entered 


¢ category is converted to a set element if the string name is the 
same 


¢ string area is not converted 


Row(s) of Data Pasted Into the Input Row 


In this case, the data is pasted into highlighted cells in the input row. 
There must be as many cells highlighted in the input row as there are 
columns in the data being pasted into the new dataset. If there are 
more columns being pasted than there are cells highlighted in the 
input row, the columns that do not fit in the highlighted columns of 
the dataset are not pasted. The dataset grows by the number of rows 
pasted into it. 


For example: 


> 


Pe 
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Activate the window Untitled-2. 


Click on the row number 1 which is to the left of the cell one in 
the Treatment column and drag down over row number 2 (or to 
the last row number if more than 2 rows have been created). All 
of row 1 and 2 become highlighted. 


Choose Cut from the Edit menu and all rows disappear. 


Choose Lipid Data from the Window menu. It becomes the 
active window. (If you have closed Lipid Data, it does not appear 
on the Window menu. Use the Open selection on the File menu 
to open it again.) 


Highlight four rows of data in the columns Age, Weight, and 
Cholesterol. 


2]. Wilson | female, 


_4|R. Beal | female | 
_5|R.James | male| 
|S. Kaufman 


—7[M. Mubroid | __mele| 23] 15a] 194] 
CS I 


a 


*> 


7 





Choose Copy from the Edit menu. 


Choose Untitled-2 from the Window menu. and it becomes 
active. 


Highlight the cells in the input row under the columns Age, 
Weight, and Result. 


Choose Paste from the Edit menu and the 12 data points (in four 
rows) are entered into the new dataset. 


Column(s) of Data Pasted to Existing Dataset Cells 


This section describes columns of data, rows of data, or a block of 
contiguous cells crossing multiple columns and rows of data being 
pasted to existing dataset cells. In this case, the data is pasted into a 
selected (highlighted) area of the existing dataset. 


For example: 


*> 


a 


o> 


*> 


> 


Highlight four rows of data in the columns Age, Weight, and 
Cholesterol in the dataset Lipid Data. 


Choose Copy from the Edit menu. 

Highlight a four row by three column section of the dataset other 
than the one you selected when you copied the data currently in 
the Clipboard. 


Choose Paste from the Edit menu, and the data in the Clipboard 
replaces the data in the highlighted cells. 


Choose Undo from the Edit menu, and the paste is reversed. 


Pasting Transposed Data 


StatView II allows data to be transposed in the Clipboard before 
pasting. To see how this works: 
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*> Enter the following values in a 3 x 3 section of Untitled—2 (be 
sure you are not trying to enter the data into string or category 
columns). 


3 
6 
9 


00 UU’ PO 


l 
4 
7 
*> Select the cells and choose Copy from the Edit menu. 


‘> Select the 3 x 3 block of cells below the block you copied (or 
three cells in the input row below the block you copied) and 
choose Paste Transposed from the Edit menu. The following 
data fills the block: 


l . 7 
2 5 8 
3 6 9 


Pasting transposed is useful for changing entire rows into columns 
and entire columns into rows. 


Pasting to Other Applications 


Datasets, views, and results can be pasted to other Macintosh 
applications. Datasets can be pasted to spreadsheets and views can 
be pasted to any application that accepts a MacPaint or MacDraw 
picture.- Analysis results can be pasted as text to other applications. 


[If you are pasting a fully zoomed view into a word processor, make 
sure the margins in the word processor document are set wide 
enough to accept the paste without compressing it horizontally. 


Cutting, Clearing, and Deleting Data 


The Edit menu contains the choices Cut, Clear, and Delete. All 
these act on selected data in a dataset. 


Cut 


Cut removes selected data and places it in the Clipboard (where it is 
available for pasting). If data the does not constitute an entire row or 
column, the row number or column name remain in the dataset, and 
cells from which data has been cut contain missing value symbols. If 
an entire row or column is selected (including row number and 
column title), the row or column is removed completely from the 
dataset. Rows below a cut row move up, and columns to the nght of 
a cut column move to the left. 


Clear 


Clear removes highlighted data and replaces it with missing values 
symbols. The data is not placed in the Clipboard, and is permanently 
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Importing Data in 
Text Files 


lost unless the Undo command (on the Edit menu) is chosen 
immediately or unless changes are not saved when the file is closed. 


Clear does not delete a row number or column head even if they are 
highlighted when the command is selected. 


Delete 


Delete operates only on entirely selected rows or columns. The row 
or column that is deleted is removed completely from the dataset. 
Rows below a deleted row move up, and columns to the right of a 
deleted column move to the left. 


The data 1s not placed in the Clipboard, and is permanently lost 
unless an Undo command (on the Edit menu) is chosen immediately 
or unless changes are not saved when the file is closed. 


If the Clipboard is the active window, the menu choice Delete 
changes to Delete Clipboard. This allows you to release the 
memory used by unwanted data in the Clipboard. 


StatView II can import text files produced by other computers or 
other Macintosh applications. If the files are coming from a 
computer other than a Macintosh, the files must be saved on a 
Macintosh disk using a terminal emulation program (such as 
MacTerminal or MicroPhone), or be accessible over a local area 
network (like AppleTalk). Text files can also be entered through a 
variety of text editors (MDS Edit, MPW, QUED) and word 
processors (MacWrite, Word, etc.). These text editors are frequently 
useful for massaging text files that contain formatting problems 
which prevent StatView II from reading them. 


Text File Format 


StatView II uses an intelligent, two pass import algorithm. It 
assumes that the text file is organized in a row by column format 
with each row occupying a line, and each data point in a line 
separated by one or more separator characters. The default format 
which StatView II expects is that each data point in the line is 
separated by a tab and that each line is delimited by a carriage return. 
Lines may be delimited by a carnage return, line feed, or the two in 
succession. Empty lines, that is, lines containing only a carriage 
return, will be discarded by StatView IL. 


Data points in the text file can be numbers or strings (alpha-numeric 
characters), and may be enclosed in quotation marks ("). Data points 
may be separated by tab characters, spaces, commas, carriage 
returns, any user-specified single character, or a combination of 
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these. A common problem encountered when importing data from a 
text file that contains several separator characters in a row. If the 
duplicated separator character is a space, then StatView II will 
compress it to a single space. However, if the separator character is 
anything else (such as a tab or comma), StatView II will insert a 
missing value in the imported dataset between the successive 
separator characters. This conforms to the standards used by many 
Macintosh spreadsheets and databases, which write out text files 
with missing fields represented as two successive separator 
characters. 


If you import a text file using something other than a spaces as a 
separator character, and it appears that data points in certain rows are 
shifted to the nght, check the text file to see if it contains unwanted 
groups of the separator character. Conversely, if you import a text 
file with spaces as the separator character and see that data points on 
some rows are shifted to the left, check the text file to make sure that 
every missing value is explicitly specified as period (.) or bullet (+) 
and not as a succession of two spaces. 


StatView II looks for carriage returns to delimit a line of text. If a file 
uses Carriage returns to separate data points, StatView IJ cannot 
determine the number of columns in the text file. In this case you are 
required to enter the number of columns present in the file. Also note 
that in this case, duplicate carriage returns represent a missing data 
point; they are not compressed into a single missing value. 


Column names may appear as a string in the first line of a text file 
(see below for criteria for column names). However, a text file is not 
required to have column names; StatView I checks the file to see if 
they are present. If the program determines that column names are 
not included in the text file, it will create default column names. You 
may change these using the Format choice in the Tools menu. 


You have the option of changing any column of small integers in a 
text file to a category data column. This is useful if the text file was 
created in a statistical program that uses integers to categorize data. 
If this option is selected, any integer column containing less than 64 
elements whose lowest value is between 0 and 6 is converted to a 
generic category during the import. You may choose Edit Category 
from the Tools menu to customize this category. 


The Import Function 
To import a text file: 


‘> Choose Import from the File menu, and the standard Macintosh 
file selection dialog appears, 


*> Choose the file you wish to import and click Import. 


The Import dialog box appears, allowing you to specify 
characteristics of the text file. The default settings assumes a text file 
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with a separator character of a tab. Check boxes let you specify the 
Separator character as either a space, comma, return, a user-entered 
character or a combination of these. If carriage return is selected as a 
Separator character, the number of data columns in the text file must 
be entered into the Number of data columns field. 


Please specify how this tent file looks. 


ltems may be separated with tabs and: 


[] spaces () commas [returns » oe 


Number of data cajlumns: 


. 
socceverces coscecsscrecsees? 


L) Convert small integers to Categories. 
Ca] 


Be careful when specifying a separator character. If the text file 
contains un-quoted strings or category data points, and any those 
data points contain the potential separating character, your file will 
be imported incorrectly. 





Criteria for Choosing Column Names 


The import algorithm of StatView II always attempts to interpret the 
first line (or first n data points for a carriage return delimited file) as 
a row of column titles. Only if the first line fails one of StatView II's 
tests is it included as a row of data. StatView II accepts the first line 
as titles if: 


* All items on the first line are non-numeric. This means that 
StatView II will not give a column a name such as “1986”, 
“1987”, etc. even if they do appear on the first line. A column 
name may however begin with a number. Names such as 
“2nd measurement” would be accepted. Enclosing a number 
In quotes does not change how StatView II looks at it; they 
are simply discarded. 


* In the unlikely case that all data points in the file are 
categories or strings, StatView II will check to see if the item 
in the first line of a category column is unique. If it is, 
StatView II will assume it is the column name. If this item 
appears in a later line in its column, StatView II then assumes 
that the entire column from first line to last contains all 
category items. If any of the category columns has a unique 
item in the first line, the entire line is considered to be 
columns titles no matter whether other category columns pass 
the test. 
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If these two tests are not met, then the columns are given default 
names (Column 1, Column 2, Column 3, and so on) and the first line 
is treated as data points. 


Stat View II does not require that the first line has the same number 
of items as the rest of the lines in the text file. In the course of 
analyzing the text file, if StatView II determines that it needs another 
column above and beyond what the title line says, it simply adds it to 
the right of the data set and gives it a default name. 


Criteria for Choosing Column Types 


In the first pass through the text file, StatView I checks two things: 
the presence or absence of column titles on the first line, and the 
types of the columns. All columns are assumed to contain the 
simplest data type, integer, until StatView II reads a data point that is 
not an integer. For example: if Stat View II reads the following 
column in a text file: 


ff WNre 


it processes the data up to the fourth line as a column of integers. 
Upon encountering the real number in line four, StatView II changes 
its assessment and reads and stores the column as a real number 
column. 


The case for reading alpha-numeric data is a bit more complex. First, 
where a column contains only alpha-numerics, StatView II keeps 
track of each alpha-numeric data point it sees and counts how many 
times it is encountered. If, after reading an entire column, StatView 
II finds that each alpha-numeric appeared only once, it will define 


‘the column as string type. If there are repeated alpha-numerics but 


there are more than 64 distinct alpha-numerics in a column, the 
column will also be of string type. Only if there are repeated alpha- 
numerics and there 64 or fewer unique strings will a column be 
declared as type category. Elements in the category are placed in the 
order in which StatView II encounters them. 


Finally, if a column in a text file contains both numeric and alpha- 
numeric data, StatView II looks at which occurs more frequently. If 
a column has ten integers and four strings, it is read as an integer 
column. The lines containing strings appear as missing values. Ifa 
column has four integers and ten strings, it will be read as alpha- 
numeric data and will undergo the evaluation outlined for alpha- 
numeric columns in the previous paragraph. Note, that numbers 
beginning with $ (such as currency) or with commas separating 
values (10,001) will be treated as a string. If a column is determined 
to be a string column, then the integers will simply appear in there 
proper cells as strings. However, if the column is of type category, 
then an integer, i, will be read as the ith element of the category. 
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Reals appearing in columns with a majority of alpha-numenics will 
be handled in a similar manner, except that a real will never be 
converted to a category element. 


Examples of Importing Data 


This is a sample text file, Import Example, included in your 
Sample Data folder. The text file, when displayed in Microsoft 
Word, appears as follows: 


12/1/87 
12/2/87 
12/3/87 
12/4/87 
12/5/87 
12/6/87 
12/7/87 
12/8/87 
12/9/87 


import example 


$1 

$10,001 
$20 ,001 
$30 ,001 
$40,001 
$50 ,001 
$60 ,001 
$70,001 
$80 ,001 


1 
2 
3 
4 


5.1 


Char tie 
Parker 
Miles 
Davis 
John 
Coltrane 
Roscoe 
Mitchell 
23.7 


1 
2 
3 
2 
Ss 
2 
1 
2 
3 
1 


12/10/87 $30 ,001 
Page | Normeal+... Ko] else see eh 


Notice that the first column contains all integers with values between 
1 and 3. The second column contains dates. The third column 
contains currency values. The fourth column has four integer values 
and six real values. The fifth column is missing three data values. 
The seventh and eighth column contain strings interspersed with 
numbers. 





Import this text file into StatView IL. 


*> Choose Import from the File menu, and the standard Macintosh 
file selection dialog appears. 


«> Select Import Example from the Sample Data folder and click 
Import. The import dialog box appears. Keep the default 
settings. 


«> Click OK. A data window named Untitled-1 will appear. 


The first column is imported as an integer column because all its 
values are integers. The second and third columns are imported as 
string columns because they contain alphanumeric (the slash in the 
dates, and the dollar signs and commas in the currency fields). The 
fourth column is imported as a real column because the column 
contains at least one real data point. The fifth column 1s imported as 
an integer column with three missing values taking the place of the 
three datapoints with no values. The sixth column 1s imported as a 
category column because it contains four repeated strings. The 
number 1 has been interpreted as the first category element, which is 
red. The number 10 has been converted to a missing value because 
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there are only 4 category elements. The last column is imported as a 
string. The value 23.7 is merely another string in the column. 


An easy way to determine the data type assigned to each column is 
to choose Format from the Tools menu. A dialog box then displays 
the type of all selected columns in the data window. 


Import the data file again. This time we will change a parameter of 
the mmport. 


*> Choose Import from the File menu, and the standard Macintosh 
file selection dialog appears. 


«> Select Import Example from the Sample Data folder and click 
Import.. The import dialog box appears. 


+> Select Convert small integers to Categories. Leave other 
defaults as they are, and click OK. 


Note that this time the first column is defined by a generic category 
set. This column was produced by the conversion of a small integer 
column to a Category column. You may use the Edit Categories 
selection on the Tools menu to change the element names in this set 
to ones that are more descriptive. 


Importing Data from Microsoft Excel 


While in Excel, choose the Save As command from the File menu. 
Click Text as the format for saving the document. Click Save to save 
your spreadsheet as a Text file. 


Excel saves as text the values as they appear in the cells of the 
worksheet. If you have have cells that are formatted as dates 
(12/07/59), currency ($1.25), or use acomma to group digits 
(10,001), StatView II will read these in as strings. If you wish these 
values to be read as numerical values, you must change their format. 
Excel separates its rows with carriage returns and columns with tabs. 


Importing Data from Acius 4th DIMENSION 


While in 4th DIMENSION, choose the Export Data command from 
the File menu. Click Text as the format for saving the document. 
Click Save to save your data as a text file. 4th DIMENSION exports 
its data with tabs separating each field and carriage returns 
separating each record. 


Importing Data from Blyth Software's Omnis 3 Plus 


While in Omnis 3 Plus, choose Delimited (tabs) as the format from 
the Export data window. Click Start to save your data as a text file. 
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Altering Datasets 


Omnis 3 Plus will then export its data with tabs separating each field 
and carriage returns separating each record. 


Removing rows and columns from datasets was discussed earlier in 
this chapter. 


Changing names of existing columns, the data type of columns, and 
the number of decimal places displayed in a real column is discussed 
here. Adding columns to datasets is also discussed here. Columns 
may be added at the right side of the dataset or between existing 
columns. The new column created has as many rows as the dataset. 
Each cell in the new column has a missing value symbol (a period). 


Adding and Inserting Columns 


Columns are added to datasets using the New Column choice in the 
Tools menu. (The selection New from the File menu creates 2 new 
dataset.) 


For example: 


*> Choose Lipid Data from the Window menu (if it is not already 
the active window), and it becomes active. 


*> Choose New Column from the Tools menu and the New Data 
Column Information dialog box discussed in the section Creating 
Datasets in this chapter appears. 


When this dialog box is prompted by the Tools menu, it creates 
columns that are added to the right side of the active dataset. 


«> Click Cancel All and the New Data Column Information box 
closes. 


*> Position the cursor between the columns names Weight and 
Cholesterol] and on the vertical line that separates the columns. 
The cursor becomes a cross with arrow on its vertical bar. (As 
mentioned before, this is the cursor that can be used to expand or 
contract column widths by clicking and dragging.) 


*> Hold down the Command key (the cursor changes to a thinner 
cross) and click. The New Data Column Information dialog box 
appears again. Now, however, the new column being defined 
will be entered in the data set between the columns Weight and 
Cholesterol. 


¢> Adda column or Cancel, as you wish. 
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Changing Column Name and Decimal Places 
*> Activate Lipid Data. 
*> Select the columns Gender, Age, Weight, and Cholesterol. 


‘> Choose Format from the Tools menu, and the following dialog 
box appears: 


Column Information 


Name: 


Name: Sex 


Type: Category male aN 
female ‘ 
Crent) 5 


The column name displayed in this box can be changed by 
clicking on it and editing. Since Gender is a category column, the 
elements of the category set assigned to that column are shown in 
the central scrolling box. 





*> Click Next. 
The next highlighted column Age appears in box. 

*> Change Age to Age - Months and click Next. 
The next highlighted column, Weight, appears in the box. 

*> Click Next. 
Notice that the Next button has changes to Done since 
Cholesterol is the last of the three columns that were highlighted 
before Format was selected. 


«> Click Done, and the box disappears. 


The column title Age is changed to Age - Months. 


Changing Column Data Type 


The data type of a column (real, integer, long integer, category, or 
string) can only be changed by pasting the data from the column into 
another (newly created) column with the desired data type. The 
method for converting one type of data to another is presented in the 
section above on pasting data. 
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Assigning Variables 


Variables (denoted as X; and Y; ) are assigned to dataset columns to 
direct StatView II to include those columns in an analysis. The ease 
with which variable assignments may be changed, and the speed 
with which analyses are recomputed after these changes, constitute 
one of StatView II's strongest interactive features. 

Variable assignments signify different pairing of columns in 
different analyses. Information about this facet of variable 
assignment 1s covered in the following section: The Mechanics of 
Running Analyses. 


This section discusses the four methods for making variable 
assignments in datasets. 


« Choose X or Choose Y from the Variables menu. 
¢ Double-clicking in the dataset. 
¢ Command-clicking in the dataset. 


¢ Choosing Quick Assignment from the Variables menu. 


Selecting Columns to Choose Variables 

The Choose X, Choose Y, and Clear X&Y selections in the 
Variables menu operate on any completely selected (highlighted) 
dataset column. 

*> Activate Lipid Data if it is not already active. 


*> Choose Age, Weight, and Cholesterol by clicking on the title 
Age and dragging across the three titles. 


‘> Select Choose X from the Variables menu, and the three 
columns are assigned Xj, X9, and X3 respectively. 


*> Select Choose Y from the Variables menu, and the three 
columns are assigned Y1, Y2, and Y3 respectively. 


*> Select Clear X&Y, and the three columns are cleared of 
assignments. 


The Select All Columns choice in the Edit menu is helpful to use 


before Choose X or Clear X&Y if you wish to make or clear 
assignments through the entire window. 
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Double-Clicking Columns to Choose Variables 


Double-clicking on a column title causes that column to be assigned 
as an X variable if it currently is something else (a Y variable or 
unassigned), and causes it to be cleared of assignment if it currently 
is an X variable. 


Holding down the Option key and double-clicking on a column title 
Causes that column to be assigned as a Y variable if it currently is 
something else (an X variable or unassigned), and causes it to be 
cleared of assignment if it currently is a Y variable. 


Command-Clicking Columns to Choose Variables 


Holding down the Command key and clicking once on a column title 
Causes that column to be assigned as an X variable if it currently is 
something else (a Y variable or unassigned), and causes it to be 
cleared of assignment if it currently is an X variable. 


Holding down the Option and Command keys and clicking once on a 
column title causes that column to be assigned as a Y variable if it 
currently is something else (an X variable or unassigned), and causes 
it to be cleared of assignment if it currently is a Y variable. 


Quick Assignment of Variables 


*> Choose Quick Assignment from the Variables menu, and the 
following dialog box appears: 


Quick Assignment for Lipid Data 
Unassigned Variables: K Variables: 


Gender 

Age a 

Weight (| Ciearnoe) 
Cholesterol ie 

Triglycerides as [ (choose 5 } | Lhoose H 


HOL s383e3 
LOL ie ( (Choose ¥ ) ¥ 


7% ideal bod 


Y ——— 





The Unassigned Variables scrolling list contains all the columns in 
the active dataset that are unassigned except string columns (which 
cannot be assigned). 


The X and Y Variables scrolling lists contain those columns in the 
data set that are currently assigned as variables. 
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Column titles can be highlighted either individually or in groups in 
any of the three lists and transferred to any of the other two lists by 
checking the buttons: 


* Clear X&Y places highlighted columns in the X or Y 
Variables list into the Unassigned Variables list, 


* Choose X: places highlighted columns in the Y Variables or 
Unassigned Variables list into the X Variables list , or 


* Choose Y: places highlighted columns in the X Variables or 
the Unassigned Variables list into the Y Variables list. 


Selections in the scrolling lists may be made using the standard 
Macintosh Shift-Click and Command-Click combinations to select 
multiple contiguous columns (Shift) or multiple non-contiguous 
columns (Command). 


Double—Clicking in the Quick Assignment Window 


Double—clicking in the Quick Assignment dialog box assigns 
variables as well. 


* Double-clicking on a column title in the Unassigned 
Variables list places that column in the X variables List. 


* Holding down the Option key and double-clicking on a 
column title in the Unassigned Variables list places that 
column in the Y Variables list. 


* Double-clicking on a column in either the X or the Y 
Variables list places that column in the Unassigned Variables 
list. 


* Holding down the Option key and double-clicking on an 
assigned column title places that column in the opposite 
variable assignment list. 


Variable Subscripting 


Each X and Y variable has a subscript. Subscripts are added 
consecutively. If several columns are being assigned at the same 
time, they are subscripted consecutively from left to right in a 
dataset. 


When an X or a Y assignment is removed from a column, all 
variable subscripts after that column move up one in the queue. 


Subscripts affect how variables are graphed and referenced in 


analyses. The following sections (Graphing Data and The Mechanics 
of Running Analyses) detail this interaction. 
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Graphing Data 


iY 


Stat View II is a powerful graphing package. Scattergrams, line 
charts, bar charts, box plots, pie charts, histograms, error bars, and 
univariate scattergrams can be viewed and printed. Graphs can 
represent as many points or variables as can be entered in a dataset. 


All views can be customized using controls that appear on the left 
side of the view window. Refer to Chapter 3 for a comprehensive 
view of graphing with StatView II. 


Selecting Variables for Graphs 


Variables can be graphed two different ways with StatView II: either 
as individual X variables or as X-Y variable pairs. These graphs can 
either be paging (with one set of data values displayed on the graph) 
or composite (with several sets of data values superimposed on the 
graph) and can reflect data with or without statistics. 


Individual X Variable 


Individual X variables are displayed on graphs with the vertical axis 
representing the data values and the horizontal axis indicating their 
row position in the dataset or columns. 


X-Y Variable Pairs 


Y variables can be graphed against X variables. If more than one X 
and more than one Y variable is assigned, X, is paired with Y,, X, 
with Y>, etc. If three X variables and four Y variables are assigned, 
the Y, variable (with no matching X variable) is not graphed. 


If there is a single X variable and several Y variables assigned, each 
Y variable is plotted against the X variable. A simular graph is 
produced if there is a single Y variable and multiple X vaniables. 


X against Y graphs can be viewed in either composite or paging 
form. In paging form, each page presents a variable pair. 


Selecting Views 


Views are selected from the. View menu (None; Table; 
Scattergram; Bar Chart; Pie Chart; Line Chart; and Box Plot). 
The selected view is marked with a check on the View menu. 
ee? a view by selecting a new view or selecting the choice 
None. 


One view window can be open for each dataset open. An active view 
window redraws graphs anytime: 
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Mechanics of Running 
Analyses 


¢« Variable selection is changed in its associated dataset 

¢ Data in the graphed X and Y columns 1s changed 

¢ Rows are excluded in the dataset (details on exclusion below) 
¢ Range restrictions are changed in its associated dataset 


¢ A new Statistic is selected 


Zooming Views 


Active view windows (except table views) can be zoomed to full 
screen size three different ways: 


¢ Select Zoom Up from the Window menu (Command-W) 
¢ Double-click on the title bar 
¢ Click in the zoom box 


It is optimal to work with the all windows zoomed to full screen size. 
Variable assignments can be changed with the view window at full 
size using the Quick Assignment choice in the Variables menu. If it 
becomes necessary to view the dataset, double-clicking in the view 
window anywhere except the control panel activates the dataset 
associated with that view. The view window can be activated from 
behind a full sized dataset by double-clicking in the datasets’ close 
box. 


The Window menu also provides window management functions. 
All open windows are listed on the menu. The active window is 
checked. Any window can be activated by selecting it from this 
menu. 


Since a visible view redraws data graphs (and analysis results) 
whenever changes are made to the data (or the analysis selection), if 
you are changing both the analysis and the data analyzed at the same 
time, zooming the dataset to full screen size saves waiting for 
redrawing after the first change has been made. 


This section is a general overview of StatView II's statistical 
operations. Information on the specific statistics is given in Chapters 
4 and 5. 

The procedure for running an analysis is straightforward: 


¢ Choose a statistic from the Describe or Compare menu 
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* Assign an X or Y to the variables which you wish to analyze 
¢ Select a view from the View menu. 


Statistics are chosen from the Describe (descriptive statistics) and 
the Compare (advanced, comparative statistics) menus. 


A chosen statistic is marked with a check by its name on the menu. 
Deselect statistics by choosing None from the top of the menu from 
which they are selected. 


Choosing any statistic from the Compare menu or Frequency 
Distribution from the Describe menu deselects all previously 
selected statistics (from either menu). Multiple statistics may be 
selected from above the gray line on the Describe menu. 


Variables Selection and the Analysis 

X and Y variable assignments direct the chosen statistic to the data to 
be analyzed. StatView II can compute the chosen statistic on a 
sequence of variables producing multiple analysis results. The 
subscripts of the variables determine how variables are used when 
multiple results are computed. 

The discussion below classifies Stat View II statistics according to 
the number of X and Y variables they use to calculate results. These 
classifications arise from the nature of the statistics. For example, a 
multiple regression references one Y variable (dependent column) 
and one or more X variables (independent columns). You may, 
however, assign several Y variables thereby obtaining separate 
analyses for each Y variable. 

OneX Statistics 


These statistics utilize one X column per analysis. OneX statistics 
calculate a result for each X variable in the dataset. 


Statistics in this category are: 
* all the descriptive statistics, and 
* the One Sample t—Test. 
ManyX Statistics 


These statistics utilize several X columns per analysis. ManyX 
statistics calculate one result using all the X variables in a dataset. 


Statistics in this category are: 
¢ Correlation Matrix, 


¢ Factor Analysis, 
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- Single Factor Repeated Measures ANOVA, 

¢ Contingency Table from Tabulated Data, and 

* Fnedman nonparametric test. 
OneXOneY Statistics 
These statistics utilize one X and one Y column per analysis. 
OneXOneY statistics calculate one result for each X-Y pair in the 
data set. Three possible variable pairings can occur. 
If there is a single X variable and’multiple Y variables assigned, then 
multiple analyses are run using the X variable paired with each 
individual Y variable. There are as many results as there are Y 
variables. 
If there is a single.Y variable and multiple X variables assigned, then 
multiple analyses are run using the Y variable paired with each 
individual X variable. There are as many results as there are X 
variables. 


If there are several X variables and several Y variables assigned, 
then multiple analyses are run pairing Xj with Y;, X7 with Y», and 


so on. There are as many results are there are pairs of X and Y 
variables with matching subscripts. 


Statistics in this category are: 
¢ Compare Percentiles 
¢ Paired and Unpaired t—Tests 
¢ Correlation Coefficient 
¢ Simple and Polynomial Regression 
¢ Single Factor ANOVA* 
* Chi-Square (One Group) 
¢ Contingency Table from Coded Data 
* All Two Group Nonparametrics 
* Kruskal-Wallis Single Factor ANOVA* 


* Both the Single Factor ANOVA and Kruskal-Wallis Single 
Factor ANOVA can reference only one X variable. 


ManyXOneyY Statistics 
These statistics utilize several X columns and one Y column per 


analysis. ManyXOneY statistics calculate one result for each Y 
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variable assigned in the dataset. There must be at least one X 
variable. 


Statistics in this category are: 
¢ Multiple Regression 
¢ Stepwise Regression** 
¢ Two-or—more Factor ANOVA 


** The Stepwise Regression can reference only one Y variable. 
Multiple results cannot be calculated. 


ManyX ManyY Statistics 

These statistic utilize several X columns and several Y columns per 
analysis. ManyX ManyY statistics calculate just one result using the 
X and Y variables. 


The statistics in the category are: 


¢ the Two-or-more Factor ANOVA models with repeated 
measures. 


Running Multiple Analyses 

It is necessary to keep the above described statistical classifications 
in mind when setting up multiple analyses through assignment of X 
and Y variables. This example shows how to compute multiple 
results using the Unpaired t-Test (a OneXOneY statistic). 

*> Activate Lipid Data if it is not already active. 

*> Choose Revert to Saved from the File menu, and click OK. 


This example computes three unpaired t-Tests comparing the 
Ages, Weights, and Cholesterol values of males and females. 


‘> Choose Quick Assignment from the Variables menu. Assign 
Gender as an X variable. Assign Age, Weight and Cholesterol 
as Y variables. 


«> Choose t-Test from the Compare menu. The t—Test Dialog box 
is presented. 


*> Click Unpaired t-Test. Click OK. 


*> Choose Table from the View menu, and the following table 
appears: 
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The Dialog Boxes: 
General Information 


View of Lipid Deta 
Unpaired t-Test X1: Gender Y1: Age 


DF: = aired t Value: Prob. (2-tail): 


Count: Mean: Std. Dev.: Std. Error: 
ce 
ffemae [2s | 24.792 3.538 722 





Each page in this paging table contains the results of an Unpaired 
t-Test for an X-Y variable pair. Since three dependent Y 


variables are specified, this paging table contains the results of 
three analyses. 


Turn the pages of the table to see the results for each test. The 
second page shows: 


View of Lipid Date 
Unpaired t-Test X1: Gender Y2: Weight 


DF: % airedt Value: Prob. (2-tail): 
Group: Count : Mean: Std. Dev .: Std. Error: 





The third page shows: 


View of Lipid Dats 
Unpaired t-Test X14: Gender Yz: Cholesterol 


DF : Unpaired t Value: Prob. (2=tail): 
Gerson ise 
E 


Group : : Mean 


mae 7s ts0.085 | 5.299 





Many of the statistics present dialog boxes after they are chosen. In 
these boxes, specifications for the analysis are entered. The boxes 
follow all standard Macintosh conventions: 


* Radio buttons are used when a group of exclusive choices is 
available. 


Check boxes are used when cumulative choices are available. 


* Regular buttons are used to cancel or activate (Done or OK) 
the analysis using the parameters set. 
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Analysis on Data in 
Groups 


¢ Pressing the Enter or Return key causes the outlined button to 
be selected. This is usually the Done or the OK button that 
closes the box putting the parameters into effect. 


¢ Text entry rectangles are presented when user specified 
numbers or names are required. Text can be entered and 
edited in these rectangles by clicking in them and typing. 
Double-clicking in them causes them to be highlighted, and 
subsequent keyboard entries replace all highlighted type. 


¢ Pressing the Tab key causes the cursor to circulate through 
the text entry rectangles in the box. 


Chapters 4 and 5 present a detailed look at each statistic’s dialog box. 


Data is often classified into groups or categories. In Lipid Data the 
Gender column divides the cases into two groups: male and female. 
The columns Smoking History, Alcohol Use and Heart History also 
classify the patients into different groups. StatView II allows for easy 
classification of groups through its Category data type. 


Grouping Columns 


Several StatView II statistics, such as the Unpaired t-Test just 
presented, analyze grouped data. 


These statistics use an X column (or several X columns) as a 
grouping variable to partition selected Y columns. These X columns 
are referred to in the documentation as grouping columns. Grouping 
columns must be Integer or Category columns. 


If you are using Integer columns to code your data into groups, 
StatView [I assumes that the number of groups in the column is 
equal to the (maximum value in the column - minimum value in the 
column ) + 1, and that coding starts from the smallest value in the 
column. 


We encourage you, however, to use Category columns to group your 
data. The Category data type is the most convenient way to specify 
the groups of your data. With Category variables, you can enter 
alpha—numerics to label unambiguously your groups. As a result, 
analysis results identify the groups by name. 


Missing Values 


66 99 


A missing values is entered by typing a “.” (period) or “*” (Option-8) 
in a cell. The missing value is displayed by the “+” symbol. 
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Stopping a 
Computation 


Computational 
Considerations 


Missing values are excluded from StatView II analyses. Statistics 
which use grouped data exclude any missing values from the group. 
The table outputs display the counts of all the non-missing elements 
which were found in the group. 


Statistics which use paired data — such as Regression, Factor 
Analysis, Correlation Coefficient — require that values be non- 
missing (complete) across all rows (cases). Missing values cause the 
entire case to be deleted. The table views for these analyses includes 
a message noting the number of cases which were deleted due to 
mussing values. 


StatView II has a quick Recode function which allows, among other 
things, for efficient recoding of missing values to column means, 


geometric means, harmonic means, or a user specified value. See 
Chapter 6 on massaging data for details. 


StatView II displays the following rotating cursor during the 
calculation of statistics: 


eC. 


While this cursor is displayed, you can abort the calculation by 
pressing the Command key and typing a period. This causes the 
computation to halt and the current View window to close. 


All calculations are performed in 80 bit extended arithmetic which 
ensures approximately 18 decimal places of accuracy. 


Sum of Squares Calculations 


Several StatView II statistics require calculation of the mean squared 
deviation: 


TKD 


StatView II uses an algorithm which provides more accurate results 
for the mean squared deviation then the Monroe Calculator vanance 
formula: 
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Viewing Analysis 
Results 


EkK45)* 


n 





StatView II uses the following algorithm for mean squared 
deviation: 


D(X-k)?- n(k-X) 


where k is the first non-missing, non-excluded value for the variable, 
and xX is the calculated variable mean. 


In addition, several statistics require that the mean deviation cross 
product be calculated: 


2(X-X)(Y-Y) 


StatView II uses the following algonthm for the mean deviation 
cross product: 


2 (X-a)(Y-b)-n(a-x)(b-Y) 


where a is the first non-missing, non-excluded value for the X 
variable, b is the first non-missing, non-excluded value for the Y 


variable, X is the X variable mean, and Y is the Y variable mean. 


Matrix Inversions 


Several statistics require matrix inversions. StatView II uses the 
Sweep Operator procedure to invert matrices. 


The View menu provides access to graphic and tabular 
representations of analysis results in much the same manner that it 
provides graphs of raw data. After an analysis is chosen and 
necessary dialog box parameters defined, choosing a view from this 
menu opens a window on the desktop that presents results. A tools 
palette is included in all graphic views allowing customization of 
view features. Chapter 3 provides in depth information on 
customizing graphic views. 


The selected view is marked with a check on the View menu. 
Deselect a view by choosing a new view or choosing No View from 
the View menu. 


60 Chapter 2 — Learning StatView II 


Each open dataset may have one associated view window open. A 
visible view window recalculates analyses anytime: 


¢ data ina currently referenced X or Y variable is changed 
¢ rows are included or excluded in the dataset (see below) 

* range restrictions are changed in the dataset (see below) 

¢ variable selection 1s changed in the dataset 

* anew Statistic is selected. 


If you are making changes to the analysis and the data analyzed at 
the same time, zoom the dataset to full screen size. This saves 
waiting for recomputation after the first change has been made. 


Available Views by Analysis 


The View menu contains the following choices: No View, Table, 
Scattergram, Bar Chart, Pie Chart, Line Chart, and Box Plot. 
Not all the graphic views are available for all statistics. Views 
available for the current analysis are selectable; views not available 
are grayed out on the menu. 


The table below lists the views available for the descriptive analyses. 


Descriptive Analyses Views 














The table below lists the views available for the comparative 
analyses. 
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Comparative Analyses Views 


ES: Fa 
a TE 5° EE 


Copying from the View Window 

















You can copy an entire view (either table or graphic) in the 
Macintosh PICT format by choosing the Copy View command in 
the Edit menu. For table views, you may wish to copy just the 
actual result values to the clipboard for later pasting to a word 
processor or spreadsheet. To copy values and titles you may: 


* select individual values or titles by clicking on them 


¢ select multiple values or titles by holding down the shift key 
and selecting them 


¢ select all values and titles by choosing Select All from the 
Edit menu 


Any selected values can be copied to the clipboard by choosing 
Copy from the Edit menu. 


More information on copying from graphic view windows is 
available in Chapter 3. 


Multiple Analysis Results: Paging vs. Composite 


StatView I can run an analysis on multiple groups of variables at 
once. Results are presented either in paging or composite views. The 
specifics of variable assignment for running multiple analyses are 
discussed above. 


Paging view windows can be paged through either by clicking in the 
scroll bar or using paging keys from the keyboard. Clicking on the 
up arrow flips back to the previous page. Clicking on the down 
arrow advances to the next page. As you page through a view, results 
are calculated for any new variables. 


If the results from the current statistic require more than one page in 
the paging window, clicking in the gray area of the scroll bar below 
the elevator advances to the first page of the next result. Clicking in 
the gray area of the scroll bar above the elevator flips to the first 
page of the previous result. 
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Preferences for Views 


Dragging the elevator down advances to a subsequent result; 
dragging the elevator up moves back to a previous result. Command- 
Down arrow flips to the first page of the previous result, while 
Command-Up arrow flips to the first page of the next results. If you 
are using an extended keyboard, Home goes to the first page of the 
first result and End goes to the first page of the last result. 


In some analyses (like Mean, Std. Dev., etc.) the choice of paging or 
composite views affects the graphic presentation of the analysis. 
These details are provided in Chapters 4 and 5. 


There are two global preference selections that affect viewing 
analysis results: the number of decimal places displayed, and size of 
the zoom for the view window. 


Decimal Places Displayed in Results 


StatView II stores and manipulates data to eighteen decimal places. 
We have seen that the number of decimal places displayed in a 
dataset is initially determined in the New Column Information dialog 
box (New from the File menu), and may be edited in the Format 
dialog box (Format from the Tools menu). 


The number of decimal places displayed in analysis results is also 
specifiable. 


*> Choose Preferences from the Tools menu, and the following 
dialog box appears: 


Preferences 
Number of Decimal Places for Results: 


OD OT O2°@3' 04 O5 O06 O7 O8 OD 


View Zoom Preference: Enter Key moves: 


© full screen © right 
® MacWrite size @ down 
© doesn't move 


ce) 





The row of radio buttons labelled Number of Decimal Places for 
Results allows selection of zero through nine decimal places to be 
displayed in analysis results. The number of places selected here 
determines the number of decimal places displayed on the axes in 
graphic views. 
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Interactive Analysis 


Zoom Window Size 


If you are using a large-screen system, determine the size to which 
the active windows are zoomed up using the radio button selection 
full screen size or MacWrite (normal) size in the Preferences dialog 
box (Preferences from the Tools menu). 


StatView II facilitates interactive analysis. The elements of 
interactive analysis include the automatic recalculation of analysis 
results when: data is edited; variables are reassigned; rows are 
included or excluded; or range restriction are defined or edited. We 
have seen how the Quick Assignment dialog box can be used with a 
view Zoomed to full screen size to reassign variables and prompt 
recalculation of results. The following section demonstrates how 
including and excluding rows and setting range restrictions can be 
used for interactive analysis as well. 


Including and Excluding Rows 


Rows can be excluded from analyses either manually or based on 
range restrictions Set in specific columns. Manual inclusion and 
exclusion is discussed here, and range restrictions are discussed in 
the section immediately below. 


Double-clicking on a row number at the left side of a dataset toggles 
that row's included/excluded state. When the row is excluded, its row 
number is gray; when the row is included, its row number 1s black. 


When a row is manually excluded it is not included in the 
computation of analyses (unless a Boolean OR range restriction 
includes it again— see below). 


Manual row exclusions of large numbers of contiguous rows can be 
performed by highlighting the rows (clicking and dragging over the 
row numbers), and choosing Exclude Row from the Variables 
menu. Large numbers of excluded contiguous rows can be included 
by highlighting them and choosing Include Row from the Variables 
menu. 


The Select All Rows choice in the Edit menu combined the choice 


Include Row from the Variables menu is a quick way to include all 
rows in a dataset. 


Select Range 


Select Range from the Tools menu lets you specify a range of values 
in a selected column that includes or excludes rows in the entire 
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dataset. Range restrictions can be set for each column in a data set. 
More than one range restriction can be set in a column. Range 
restrictions can be set in continuous or category data columns. 
Ranges are unset when data is edited. 


The Select Range choice on the Tools menu is available whenever 2 
column 1s highlighted. 


*> Highlight the columns Gender, Age and Weight in Lipid Data. 


‘> Choose Select Range from the Tools menu, and this appears: 


Include rows besed on the renge of values in column: 
-~-Gender 


a 


@ AND 


Currently included data the above restriction 





The title at the top of the dialog box identifies the column for 
which range restrictions are being set. Gender, being the leftmost 
of the highlighted columns, is first to appear in the Select Range 
dialog box. Because Gender is a category column, a scrolling list 
displays the elements of the category set that define the column. 


*> Highlight only male (this excludes all female cases). 
At the bottom of the window is the radio button selection 
Currently included data AND or OR the above restriction. If 
AND is selected: 


1. Everything outside the specified range is excluded from 
analysis. 


2. Existing range restrictions on data inside the specified range 
remain in effect. 


If OR is selected: 
1. Everything inside the specified range is included in analysis. 


2. Existing range restrictions outside the specified range remain 
in effect. 


«> Select AND. 
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Since there are no previously existing range restrictions, all male 
cases are included and all female cases are excluded. 


‘> Click OK and this appears: 


Include rows based on the range of values in column: 
--fige 


lower bound: ? upper bound: 
eons =] 


© 
< 


Cancel! 


@ AND 


Currently included data the above restriction 





Age, the next highlighted column is presented. 


Range restrictions for continuous data are determined using the 
lower and upper bounds text entry rectangles and the radio button 
selections between them. If the dataset contains less than 1,000 rows, 
the default values in the bounds rectangles are the lowest and the 
highest values in the column. If the dataset contains more than 1,000 
rows, two buttons labelled Min and Max appear under the bound 
boxes allowing you to elect to have the highest and lowest values 
displayed. In this example, the lowest age in the dataset is 20 and the 
highest is 40. 


*> Press Tab and enter 25 in the upper bound rectangle. 


*> Leave AND selected, click OK, and this appears: 


Include rows based on the range of values in column: 


--Weight 


lower bound: upper bound: 
= 
Ox O< Cancel 


@ AND 


Currently included data the above restriction 





Now there are two restrictions set for the data. The group defined 
so far for the analysis is males aged of 20 through 25. The next 
and last highlighted column, Weight is displayed for range 
restriction setting. 


*> Enter 150 for the lower bound. 


*> Click Done. 
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The included rows are now the set of males, aged 20 to 25, who 
weigh 150 lbs. or more. Graphing the data from these three 
columns displays the range restrictions. 


The Boolean operator OR is useful when setting two non-continuous 
ranges in the same variable. For example, if you wanted to view the 
cases in the data whose ages were 18 through 25 and 30 through 32, 
you could exclude all rows, and then use the Boolean OR to include 
these two ranges. To do this, set the first range, choose Select Range 
from the Tools menu again, and set the second range. 


While range restrictions can only be defined when the dataset is the 
active window, they may be edited when the view window is active 
and Zoomed to full screen size. This interactive feature is discussed 
immediately below. 


Edit Range 


The Edit Range choice in the Tools menu is selectable only when 
ranges are in effect. This editing function, however, can be used 
while the view window is active. Therefore, range restrictions can be 
altered and the effects of the changes can be instantly analyzed and 
viewed. 


*> Make sure the range restrictions set in the above example are stil] 
in effect (they will be if you have not edited a datapoint or 
included or excluded a row). If you have unset the restrictions, 
highlight the three columns and reset the restrictions. (Grayed out 
rows numbers indicate excluded cases.) 


«> Assign Gender, Age, and Weight as X variables. 


*> Choose Line Chart from the View menu and zoom view to full 
screen $1ze. 
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The composite line chart shows 45 cases meeting the range 
specifications: gender (all ordinal value 1: male); age (20 through 
25); and Weight (150 through 234). 


*> Choose Edit Range from the Tools menu and this appears over 
the view window: 


Range Restriction #1 
Column: Gender 


Boolean: AND 





This box displays the first range restrictions set in the data. The 
name of the column in which the restrictions are set; the Boolean 
operator; and the range of restrictions (in this case, the included 
category elements) are displayed. Three buttons allow you to 
either Edit, view the Next range restriction, or Exit the editing 
mode. 


The dialog box in which range restrictions are edited is exactly 
the same as the one in which they were onginated. 


*> Click Edit. 

*> Change the Gender to female, and click OK. — 
*> Click Next. 

*> Click Edit. 

*> Change the Age range to 21 through 30. 

*> Click OK. 


*> Click Exit, and the view window redraws showing three cases of 
females from ages 21 through 30 who are over 150 lbs. 


Assigning any other column as an X variable would cause that 


variable to be displayed on the graph along with the gender, age, and 
weight variables. 


Clear Ranges 


The Clear Ranges choice in the Tools menu clears all ranges set ina 
dataset. After the choice has been made, a dialog box appears asking 
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Saving Data 


for confirmation of the command. If OK is clicked, all rows are 
included in analyses until further range restrictions are defined. 


Read Only Files 


If you attempt to open a file that is already open, StatView I informs 
you that the file 1s open and asks you if you wish to open a read-only 
copy of the file. If you wish to open a read-only copy, the file is 
opened with the words “— read only” appended to the file name. 


A read-only file cannot be modified; you cannot change any of its 
characteristics. All StatView II functions that alter files (editing, 
cutting, deleting, transforming, etc.) are disabled for read-only files. 


A read-only file is useful] for comparing analyzed data to a 
benchmark. To do this, open a dataset and a read-only copy of the 
data. Select the same statistic for both datasets. Modify the non- 
read-only data and compare the results with the original results 
preserved in the read-only view window. 


To save the active dataset, choose Save from the File menu. If the 
file is untitled, StatView II presents the standard Macintosh save-file 
dialog box. The box includes a radio button selection for saving the 
file in normal format (the format StatView II keeps its data in) or text 
format (for exchanging data with other programs). 


Normal Format 


The normal format stores and allows future reuse of data with the 
full precision of the Macintosh (about 18 digits). The text file format 
saves only the number of decimal places specified for display 1n each 
column. 


All column names, widths, and font styles are stored in norma] 
format. When a file saved in normal format is reopened, it looks 
exactly as it did when it was saved, and analyses run on it take 
advantage of full decimal] precision whether the places are displayed 
in the dataset or not. 


StatView I reads normal format files much faster than it reads text 
files. 


Files should always be saved in normal format unless you wish to 
transport data to another program or computer. 
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Text File Format 


Selecting text file format allows you to save StatView II data as text 
to help exchange data with another program. If text format is 
selected, this dialog box appears: 


Please specify how to save this text file. 


Separate items with: 
@ tabs Ocommas Oreturns = Soe 


[_] Save column names. 
C] Enclose text items with quotes. 
(J Save Category columns as small integers. 





StatView II lets you choose characters, called delimiters, to specify 
the end of each data point. You may select one of three default data 
point delimiters (tab, comma, or return) or specify one in the text 
entry rectangle. Unless you have a specific reason for selecting an 
alternative delimiter, use tabs as delimiters. (They are the Macintosh 
standard.) The check boxes allow saving of column names (as the 
first row in the text file), enclosing text in quotes, and saving 
Category set elements as their ordinal number representation. 


If text format is chosen, the file's data is saved only to the number of 
decimal places specified for each column. If you save a file as a text 
format file, make sure enough decimal places are displayed for each 
column using the Format choice on the Tools menu. 


Large files saved as text files may not be able to be reopened. (Any 
file saved as normal format can be reopened.) If you cannot load a 
large text file, try loading on a Macintosh with more RAM. 


Files should always be saved in normal format unless you wish to 
transport data to another program or computer. 


See the section on importing text files for information on the 
structure of StatView II's text files. Be sure any program or computer 
that you are transferring StatView II text files to accepts this format. 
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Printing with StatView 
IT 


The Print command lets you print a view window. When you choose 
the Print from the File menu, you are presented with a modified 
Print dialog box. This new dialog consists of the standard Print 
dialog on top and several new options available for printing view 
windows on the bottom. This appears if you are using an 

Image Whiter: 


I!mageWriter 26 | 


Quality: @ Best © Faster © Ornate 

Pege Renge: @All © From: [| To: Pa 
Copies: 

Peper Feed: @Automsatic © Hand Feed 


PeperFeed: @Ritometic Owend ree 


Page Footer. 


Cj Don't print Page Frames 
[) Use whole sheet of paper for View 





The Draft option is grayed out for graphic views because graphics 
cannot be printed in draft mode. 


Several customizing features are available. They are: 


Titles and footers — Text entered in the input rectangles 
labelled Page Title and Page Footer labels every page of 
your printout with a common title and footer. If no text is 
entered in a title or footer rectangle, then no title or footer 
appears. 


Removing frames — Normally, when StatView II prints a 
view window, it prints a frame around it making it resemble 
the window on the screen. You can stop the frame from being 
printed by checking Do not print page frame. 


Full page views — It is often useful to have a graph printed 
as large as possible. Checking the Use whole sheet of paper 
for View box causes StatView II to use as much of the paper 
as possible to print out the graphic view. This selection can 
cause a graph to be distorted. For example, if your 
ImageWriter has 8.5" x 11" paper in it and the printing 
orientation is normal, when you print using the full sheet, the 
graph 1s stretched in the vertical dimension. This selection is 
most appropriate when printing configuration is set sideways 
(landscape). This selection is available for graphic views 
only. 


Print scaling — This option allows your graph to use the 


whole sheet of paper. Since the graph you are printing is 
typically smaller than a sheet of paper, StatView II must 
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expand the graph's size and scale items within the graph 
accordingly. User-drawn shapes, although they are moved 
around on the graph, are not re-sized. If you drew a square 
around the plotting area of a scattergram and printed that 
scattergram using the whole page, your square is printed over 
the upper-left corner of the plotting area, no longer enclosing 
the graph. Because of this, full sheet printing is only useful 
when printing graphs without user-added text or shapes. 


Any range restrictions that are in effect are printed in views. 


Note: You can produce color output by printing with an ImageWniter 
II or ImageWriter LQ using a color nbbon. Since a Macintosh II can 
generate more colors than these printers, the colors you print will 
often be an approximation of the colors on your screen. Any other 
color printers, color plotters or slide making hardware which are 
compatible with the Macintosh II will also produce color output of 


your graphs. 


No additional options are available for printing a data window. The 
standard print dialog for your printer appears when Print is chosen 
from the File menu. 
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Introduction 


Chapter 3 — Drawing with StatView II 


So far, you have only briefly seen Stat View II's powerful drawing 
capabilities. This chapter shows you how you can use StatView II to 
gtaph your Statistics and data and to modify your graphs to make 
them clearer. 


As you read this chapter, you should note the difference between the 
two types of drawing: 


+ Creating a chart from your statistics and data is a mechanical 
routine that StatView II does for you. When you select a 
chart, StatView I automatically draws the axis, tick marks, 
data points, legend, and so on. You will find that most of the 
charts that StatView II creates are so complete that you do 
not need to modify them at all. 


* Modifying a chart is easy. You can change features of the 
chart (such as the colors used, the point types, the range of 
the axes, and so on) and can add information to the chart 
(such as pictures, additional text, arrows to highlight 
important data, and so on). The tools you use to add 
information to the chart are almost identical to those used in 
popular programs such as MacDraw. 


The tools palette on the left side of the screen has two columns. 





The left column contains drawing tools for selecting and adding 
information to your chart. The right column contains view controls 
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Graphs You Can 
Make 


for changing graph attributes and modifying what you see in your 
chart, such as the axes and points. 


Note that this “palette” is available at all times when a graph is 
shown. As you read this chapter, remember that the drawing tools 
and view controls appear in the view window, not in the Mac's pull- 
down menus. 


Note: The tools palette is not copied or printed with the graph. 


StatView II gives you a very wide variety of graphs. Since graphs 
are related to the types of data you are using, there are often many 
relevant graphs for a particular set of data. This section lists all the 
types of graphs you can produce with StatView II. 


The following sections tell you what you need to do in order to 
create graphs. The elements you must specify are: 


* Variables — The number of columns selected, and whether 
they are specified as X or Y variables. 


* Statistics — Choices from the Describe or Compare menu, 
such as Percentiles or Regression. 


* View — Specify the type of chart you want to see from the 
View menu, such as Scattergram or Box Plot. 


* Composite/paging tool — 


- 


This is an important tool for most views. It is described later 
in the chapter. 


Univariate Chart 


Univariate plots present one-dimensional data. They have only an 
ordinate (Y axis), as there are no values for the abscissa. Each 
individual observation is plotted. 


To create a univariate chart, you need to have assigned at least one X 
column and no Y columns. If you have more than one X column 
assigned, you may overlay all the X variables on one composite 
chart or view each individual X column on a paging chart. Select 
None in the Describe and Compare menus. In the View menu, you 
may choose either Scattergram, Bar Chart, or Line Chart. 
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A univariate scattergram comparing two X variables looks like: 


Scattergram for columns: X; ~. X2 


240 
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If you click the composite/paging tool, each variable will appear in a 
separate page. The first page for the previous chart looks like: 


Scattergram for column: X; Weight (Ibs) 


© Weight (ibs) 


Weight (Ibs) 


db 
oO 
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The second page looks like: 


Scattergram for column: X2 weight-3yr 
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A univariate line chart looks like: 
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Line Chart for column: X14 Weight (Ibs) 
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A univariate bar chart looks like: 


Bar Chart for column: X14 Weight (bs) 
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rool 
Observations 


There is no composite view available for the bar chart. 


If you choose Mean, Std. Dev., etc. from the Describe menu, the 
standard deviation error band and mean are noted by lines and 
marked on the right ordinate. A univariate scattergram with mean 
and standard deviation lines looks like: 


Scattergram for column: X; Weight (Ibs) 


© Weight (Ibs) 


Weight (Ibs) 





Observations 
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If you set the composite/paging tool to composite you will produce a 
plot of means with error bars around them (see the section on Error 
Bars below for further information). 


Percentile Plot (Cumulative Frequency Curve) 


A percentile plot plots observed values against their percentiles. It 
allows you to quickly estimate the percentile associated with any 
observed value in a distribution. There are several different views 
available. 


To create a percentile graph, assign at least one X column and no Y 
columns. If you assign more than one X column, you may overlay all 
the X variables on one composite chart or view each individual X 
column on a paging chart. Select Percentiles in the Describe menu 
and select None in the Compare menu. In the View menu, choose 
Scattergram, Bar Chart, or Line Chart. 


A composite percentile graph comparing two X variables looks like: 


Percentiles Plot for columns: X; ... Xo 
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If you click the composite/paging tool, each variable will appear in a 
separate page. The first page for the previous chart looks like: 


Percentiles Plot for column: X; cholesterol (mg/dI) 
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The second page looks like: 


cholesterol (mg/dl) 


Percentiles Plot for column: Xj female 
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Clicking the percentile control 


forces horizontal lines to be displayed on the plot representing the 


five per 


centile values (10th, 25th, 50th, 75th, and 90th). This control, 


which is unique to percentile plots, is only available on the paging 


view. 


cholesterol (mg/dl) 





Percentiles Plot for column: X14 cholesterol (mg/dl) 
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You can view this chart as a line chart or as a bar chart: 
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Percentiles Plot for column: X; cholesterol (mg/d1) 
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There is no composite view available for bar charts. 


Scattergram 


A scattergram plots the relationship between two variables, X and Y. 
The types of scattergrams are: 


* regular scattergram 
* scattergram with fitted simple regression 
* scattergram with fitted polynomial regression 


To create a scattergram, assign one or more column as X variables 
and one or more columns as Y variables. 


If more than one X and more than one Y variable are assigned, Yj 1s 
plotted against Xj, Y2 against X, and so on. If three X variables 
and four Y variables are assigned, the Y4 variable (with no matching 


X variable) is not graphed. If There is a single X variable and more 
than one Y variable, each Y variable is plotted against the X 
variable. The same situation occurs if there is a single Y variable and 
more than one X variable. 


If you have assigned more than one X-Y pair, you may Overlay all 
the variable pairs on one composite chart or view each individual X- 
Y pair on a paging chart. 


Select None in the Describe menu. In the Compare menu choose 
None or choose Regression if you wish a regression line fitted to 
your data (in the Regression dialog, choose Simple or Polynomial). 
In the View menu, choose Scattergram. 


A regular scattergram with two Y variables plotted against one X 
variable looks like: 
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Scattergram for columns: X;Yq -.. 4172 
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A simple regression fitted to this data looks like: 


Scattergram for columns: X1Y 1 ... 112 


+ aluminate 
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heat evolved (calories/gm cement) 


The equations for these lines are located in the table views and will 
be discussed in Chapter 5. 


The confidence bands control unique to a scattergram with fitted 
simple regression. 


SN 


This control allows you to display confidence bands for the mean 
and slope of the fitted regression. 
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Scattergram for columns: X;,Y_ ~. X1¥2 
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A polynomial regression fitted to this data looks like: 


Scattergram for columns: X;¥4 ~.. X;Y2 
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If you click the composite/paging tool, each variable pair will appear 
on a separate page. The first page shows: 


et. -170.91 - 227x + .134x2 - .002x5 + 7.737E-6x4 


tricalcium aluminate 
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heat evolved (calories /gm cement) 


The second page shows: 
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Comparison Percentile Chart 


A comparison percentile chart compares 19 corresponding 
percentiles of two variables. The precentiles compared are: 1, 2, 3, 4, 
5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 95, 96, 97, 98, 99. This chart is 
extremely effective for comparing two data distributions. 


To create a comparison percentile chart, select one or more columns 
as X variables and one or more column as Y variables. 


If more than one X and more than one Y variable are assigned, Y is 
plotted against X1, Y2 against X7, and so on. If three X variables 
and four Y variables are assigned, the Y4 variable (with no matching 
X variable) is not graphed. If There is a single X variable and more 
than one Y variable, each Y variable is plotted against the X 
variable. The same situation occurs if there is a single Y variable and 
more than one X variable. 


If you have assigned more than one X-Y pair, you may overlay all 
the variable pairs on one composite chart or view each individual X- 
Y pair on a paging chart. 


In the Describe menu you should have None selected. In the 
Compare menu choose Compare Percentiles. In the View menu, 
choose Scattergram or Line Chart. 


A comparison percentile scattergram with one X variable compared 
to one Y variable looks like: 
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Percentile Comparison of Patient Weight by Gender 
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The equal axes control is unique to a comparison percentile chart. 


l= 


This contro] allows you to cause the comparison percentile chart to 
be a square chart displaying the line y = x . 


A comparison percentile line chart with one X variable compared to 
one Y variable looks like: 


Percentile Comparison Line Chart Of HDL Levels 
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Line Chart 


A line chart plots the relationship between two variables, X and Y. It 
is one of the best displays of a set of measurements of a variable 
through time. The line chart can be drawn with plotting symbols to 
clearly distinguish individual data points, or without plotting 
symbols 


To create a line chart, assign one or more column as X variables and 
one or more columns as Y variables. 
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If more than one X and more than one Y variable are assigned, -Y is 
plotted against X;, Y2 against X2, and so on. If three X variables 
and four Y variables are assigned, the Y4 variable (with no matching 


X variable) is not graphed. If There is a single X variable and more 
than one Y variable, each Y variable is plotted against the X 
variable. The same situation occurs if there is a single Y variable and 
more than one X variable. 


If you have assigned more than one X-Y pair, you may overlay all 
the variable pairs on one composite chart or view each individual X- 
Y pair on a paging chart. 


In the Describe and Compare menus, you should have None 
selected. In the View menu, select Line Chart. 


A regular line chart with two Y variables plotted against one X 
variable looks: like: 
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If you click the composite/paging tool, each variable pair will appear 
on a separate page. The first page looks like: 


Unemployment 1946 - 1964 
S00 
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1946 1948 1950 1952 1954 1956 1958 1960 1962 1964 
Year 
The second page looks like: 


84 Chapter 3 — Drawing with StatView II 


Armed Forces 1946 - 1964 
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The no symbols control is unique to a line chart. 


This control allows you to specify whether or not you wish the line 
chart to contain plotting symbols. 


A line chart with one Y variable plotted against one X variable and 
no plotting symbols looks like: 
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Bar Chart 


A bar chart plots the relationship between two variables, X and Y. 
Like the line chart, it best displays a set of measurements of a 
variable through time. 


To create a bar chart, select one or more columns as X variables and 
one or more columns as Y vaniables. 


If more than one X and more than one Y variable are assigned, Y] is 
plotted against Xj, Y2 against X9, and so on. If three X variables 
and four Y variables are assigned, the Y4 variable (with no matching 


X variable) is not graphed. If There is a single X variable and more 
than one Y variable, each Y variable is plotted against the X 
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variable. 


The same situation occurs if there is a single Y variable and 


more than one X variable. 


The bar chart can only display one individual X-Y pair per page; 
there is no composite view. 


In the Describe and Compare menus, you should have None 


selected. 


In the View menu, choose Bar Chart. 


A bar chart with one Y variable plotted against one X variable looks 


like: 
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Histogram 


A histogram diplays the frequency distribution of a variable. 


To create a histogram, assign one or more columns as X variables. In 
the Describe menu, choose Frequency Distribution. In the View 


menu, ch 


oose Bar Chart. The histogram can only display one 


individual X variable per page; there is no composite view. 


A histogram of continuous data looks like: 
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Histogram of X;: Weight (ibs) 
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A histogram of category data looks like: 


Histogram of X;: Alcohol use 
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z-Score Histogram 


A z-score histogram displays the z-score frequency distribution of a 
variable. 


To create a z-score histogram, select one or more columns as X 
variables. In the Describe menu, select Mean, Std. Dev., etc., or 
Confidence Intervals. In the View menu, choose Bar Chart. The 
composite/paging tool should be set for paging. 


A z-score chart looks like: 


Z Score of XK; : Weight (lbs) 
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If you set the composite/paging tool to composite, you will produce 
a comparative bar chart, described below. 
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Comparative Bar Chart 


A comparative bar chart compares the mean values from similar 
variables. The X axis indentifies the columns while the Y axis 
displays numerical values. 


To create a comparative bar chart, select one or more columns as X 
variables. In the Describe menu, select Mean, Std. Dev., etc. or 
Confidence Intervals. In the View menu, choose Bar Chart. The 
composite/paging tool should be set for composite. 


A comparitive bar chart looks like: 


Bar Chart of Column Means: X1 .. %4 
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If you set the composite/paging tool to paging you will produce a z- 
score histogram of each X variable (described above). 


Pie Chart 

Pie charts allow you to visually compare categories to each other. 
They are ideal for displaying information about the category 
variables of a StatView II dataset. 

To create a pie chart, select one or more columns as X variables. In 
the Describe menu, select Frequency Distribution. In the View 
menu, choose Pie Chart. The pie chart can only display one 
individual X variable per page; there is no composite view. 


A pie chart of category data looks like: 
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Pie Chart of X1: Alcohol use 
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Error Bars 


Error bars compare the distribution of variables by displaying 
variation around sample means. StatView II creates one-, two-, or 
three-tiered error bars portraying confidence intervals or the standard 
deviation of a variable. 


To create error bars, select one or more columns as X variables. In 
the Describe menu, select Mean, Std. Dev., etc. or Confidence 
Intervals selected. In the View menu, choose Scattergram or Line 
Chart. The composite/paging tool should be set for composite. 


Error bars look like: 
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The bar contro] is unique to error bars. 
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It allows you to to add a bar to the bottom of your error bars. Error 
bars with bars added look like: 
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Connected error bars in a line chart look like: 
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If you set the composite/paging tool to paging you will produce a 
univariate scattergram for each X variable (described above). 


Box Plots 


A box plot is a graphic method for displaying the 10th, 25th, 50th, 
75th, and 90th percentiles of a variable. It is often used for 
comparing variable distributions. 


To create a box plot, select one or more columns as X variables. If 
you have more than one X column assigned, you may overlay all the 
box plots on one composite chart or view each individual box plot on 
a paging chart. In the Describe menu, select Percentiles. In the View 
menu, choose Box Plot. 


A box plot comparing two variables looks like: 
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Box Plots for columns: X; ... X> 
300 


280 
260 
240 
220 
200 
180 
160 
140 
1204 - 8 
100 


oo 


e000 0 


Cholesterol (mg/dl) 


oOo 


male female 
Columns 


If you click the composite/paging tool, each variable will appear on a 
separate page. The first page looks like: 


Box Plots for column X; : male 
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The second page looks like: 
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Two controls appear on this view which are unique to box plots. The 


notch contro] 
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Composite and Paging 
Views 


changes the box plot to a notched box plot, where the notches 
represent 95% confidence bands about the median. 

Box Plots for columns: X; ... X2 
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eliminates the representation of the the extreme twenty percent of the 
observed values, ten percent below the 10th percentile and ten 
percent above the 90th percentile. 


The outlier control 


The composite/paging control is in the upper nght corner of the tool 
palette. When it is in composite mode, the icon looks like: 


- 


When it is in paging mode, it looks like: 


CI 


In composite mode, the graphs portrays more than one variable or X- 
Y combination. This lets you see the interrelationships of many 
variables at the same time. 


In paging mode, the graph portrays only one variable or X-Y 
combination per page. The scroll bar on the right side of the screen 
becomes active; clicking in the scroll bar advances to the next 
variable or combination. Paging mode is useful for inspecting 
individual variables. 


Note that toggling from paging to composite mode can change the 
type of graph you see. If you have a univariate scattergram with 
Mean, Std. Dev., etc. selected in paging mode, toggling to 


92 Chapter 3 — Drawing with StatView I 


View Controls 


composite mode will change the graph to error bars. Also note that 
two view controls (specify subset and error bars) are only available 
in composite mode. Even if you have only one X-Y pair, you can put 
your graph in composite mode to access these view controls. 


The view controls in the tools palette let you change the way that 
StatView IJ displays your data. Some of these controls are available 
for all charts, while others only apply to certain chart types. 


Many of the view controls are toggles, meaning that they switch 
between two settings. For instance, the composite/paging control you 
saw above 1s a toggle since it specifies which of the two possible 
modes you are in. If you are in paging mode, clicking the 
composite/paging control toggles you to composite mode. If you are 
in composite mode, clicking the composite/paging contro] toggles 
you to paging mode. 


The other controls are parameter controls which let you enter values 
for how the control should act. For instance, the point overlap 
control lets you choose how to show overlapping values in a graph. 
Remember, the tools palette is not copied or printed with the graph. 
To see how these controls work: 


*> Activate Lipid Data. 


*> Choose Quick Assignment from the Vars menu, clear previous 
‘assignments, and make Weight an X variable and Cholesterol a 
Y variable. 


*> Choose Scattergram from the View menu, and Zoom Up the 
view. 


Scattergram for columns: X; V4 
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Frame 


The frame control specifies whether the chart is framed on all four 
sides or just the left and bottom. Charts are initially frame. When you 


are in frame mode, the icon is: 


Clicking on the control switches you to unframed mode. When you 
are in unframed mode, the icon is: 


i=l 


Point Overlap 


The point overlap control looks like this: 


Graphs often contain overlapping points. This occurs when the 
locations of different datapoints are either identical or very close to 
each other. 

For example, modify Lipid Data so that values overlap. 

*> Select the first record. 

*> Choose the Copy command in the Edit menu. 

*> Select the the gray additional record at the bottom of the data. 


*> Choose the Paste command in the Edit menu. 


This creates two data points at the same location. To experiment with 
handling overlapping points: 


*> Click on the point overlap tool and this dialog box appears: 


Select method of handling point overlap 


Determine overiap with: Show overtap with: 


@ Don't handle overtlap @ Sunflowers 
© Exact Coincidence © Higger Paints 


© Ceiluation 
Resolution: © Caaise @© Medium O Fine 
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This box offers three choices for displaying overlapping points 
and two styles for representing those points. Under the heading 
Determine Overlap With are the choices Do Not Handle 
Overlap; Exact Coincidence; and Cellulation. 


The selection Do Not Handle Overlap, the default, means that 
StatView II does not do anything to indicate hidden points 
caused by overlap. Points which exactly coincide appear as a 
single point and points which partially coincide appear as 
overlapping points. 


The two selections below this one are different ways for 
displaying overlapping points. If you choose either of these 
methods, you must specify how to display the overlap. There are 
two methods for displaying the points: 


Sunflowers Points with petals (lines) emanating from them. 
Each petal represents one point at that location. A 
point with no petal represents a single datapoint. 


Bigger Points Geometrically enlarged points with enlargement 
size based on the number of points that coincide 
at that location. The size of points representing 
single datapoints is determined by the point size 
control. The bigger the starting point, the larger 
are the enlarged points. 


When Exact Coincidence is selected, points that exactly overlap 
are displayed according to the selected way to show overlap. 


*> Choose Exact Coincidence. 


*> Choose Sunflowers, click OK, and this view appears: 


Scattergram for columns: K;Y; 
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There are two data points that overlap, since the first and last 
records are identical. They are represented by a sunflower with 
two petals. To see what happens if three points overlap, add the 
record again. 


Chapter 3 — Drawing with StatView I] 95 


96 


ei 


Activate the data window. 


Copy the last record and paste it in to the gray additional record 
at the bottom of the window. 


Activate the view window; notice that the sunflower now has 
three petals. 


Click on the point overlap tool. 


Choose representation by Bigger Points, and click OK. 


Scattergram for columns: X11 
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The three points are now represented by a larger data point. 


The third choice under the heading Determine overlap with is 
cellulation. This method is best for handling large datasets. 
When this choice is selected, Stat View II: 


* Divides the scattergram into invisible grid regions. 
¢ Counts the points that occur in each region. 


¢ Represents the population of each region as either a 
sunflower point (with each petal representing one point in the 
grid square) or as a geometrically enlarged point. The point is 
centered in the middle of the square whose population it 
represents. 


Click on the point overlap tool and choose cellulation. Leave 
Bigger Points selected. 


The bottom radio button array titled Resolution becomes 
selectable. It contains the choices: Coarse, Medium, and Fine. 
These selections determine the size of the region used in 
cellulation. Fine divides the scattergram into small, numerous 
regions, Medium divides the scattergram into fewer medium 
sized regions, and Coarse divides the scattergram into even 
fewer large regions. 
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*> Choose Medium for resolution, click OK, and this view appears: 
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The view shows various blocks of circles. You can see how the 
graph 1s divided into a grid. The size of the circle shows the 
number of points within each grid square. 


*> Click on the overlap tool, select Coarse, and use Sunflowers to 
Show overlap. Click OK. This view appears: 


Cholesterol 
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The graph is now divided into larger grids. The number of petals 
in the sunflower is the number of | points counted in the region. 


The parameters which you set using the Overlap control remain 
in effect for all successive scattergraphs. Note that if you are 
handling overlap, the subset specify and error bar controls are no 
longer active. 
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Subset Specify 


The subset specify control looks like this: 


rm 
Pid 


This control allows you to differentiate subsets of values in a 
scattergram or line chart by a different plotting symbol or color. 
These subsets are identified by the category variables of the dataset. 
The subset specify tool is only available when you are in composite 
mode. 


*> Click on the subset specify control, and this dialog box appears: 


Select Category: 

Gender:-- 7 

Smoking History 

Alocohol use 

Heart Attack Remoure 


Cancel 


Heart History 





The scrolling list contains the names of all the category variables 
in your dataset. Choose the category containing the subsets you 
wish to identify in your graph. 
*> Choose Gender 
*> Click OK. 
Notice that different points (or color) now distinguish the male 
and the female data. The legend has also been updated to indicate 
the subsets in your data. 
To remove a subset click the Remove button in the dialog box. 
Note that selecting a new statistic, choosing a different view, or 


toggling from composite to paging will remove any subset 
specifications. 


Error Bars 


The error bars control looks like this: 


2, 
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This control allows you to place customized error bars on Y variable 
observations in scattergrams and X variable observations in 
univariate scattergrams. It 1s available only for composite views. 


*> Click on the error bars tool, and this dialog box appears: 


Error Bars for: Systolic BP 


© No Error bars for this column 

© Use a fixed error of 

© Use fo as he of the date value 

© The selected column contains the error 


Age 

Weight 
Cholesterol 
Chol - lyr 
Chol - 2yr 
Chol - 3yr. 


| Next 
Cancel 


Remove All 





You may specify error bars values for each Y variable on the 
graph. The variable currently referenced is noted in the title. 


You may specify: 
¢ No Error Bars for a column 


¢ Fixed Error Value for each data point (value is entered in 
the text box) 


* 9% of the Data Value as the error value (percentage is 
entered in the text rectangle and applied to each observation) 


¢ Selected Column Contains the Error Values (select a 
column from the scrolling list to contain the error values for 
the data column. The error value is taken from the same row 
as the data value with which it is associated. ) 


Click Next to move to the next Y variable. If there are no more 
columns, this button changes to Done. 


«> Click Remove All to remove all error bars from the graph. 

*> Click Cancel to abandon the operation. 

If you have chosen a data column as containing the error values, 
changing values in this column causes the graph to be redrawn. If 


this column 1s cut or deleted, no error bars are drawn for the 
associated data column. 
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Error bars for means and confidence bands for X variables are 
discussed earlier in this chapter in the discussion of Error Bar graphs. 


Confidence Bands 


If you are graphing a simple regression, the confidence bands control 
is added to the view controls menu. It looks like: 


BS 


Use this control to add confidence bands to your chart. 


*> Click on the confidence bands tool; the following dialog box 
appears: 


Select confidence information for a Simple Regression 


[] 95% confidence limits for slope of regression line 
[] 95% confidence bands for the true mean of ¥ 


(] 90% confidence limits for siope of regression line 
(] 90% confidence bands for the true mean of ¥ 





For both confidence intervals you can plot: 
¢ the confidence limits for the slope of the regression line. 


¢ the confidence bands for the true mean of Y. 


Outlier 
If you have a box plot, you can specify whether or not to show 


outliers. Outliers are those value below the 10th percentile and above 
the 90th percentile. The default is to show outliers. The no outlier 


control is: 


When you click this, it becomes the add outliers control: 


ie) 
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Notch 


If you have a box plot, the notch contro] switches from a regular box 
plot to a box plot with notches in its sides. The default is no notch. 


The notch control looks like: 


When you click it, it becomes the no notch control: 


Percentile 
The percentile control lets you add to or remove from your percentile 
graph lines representing the 10th, 25th, SOth, 75th, and 90th 


percentiles. The default is not to draw the lines. The percentile 
contro] is: 


Clicking this changes the tool to the no percentile line control: 


Bar 


If you have an error bar chart, this contro] lets you add a bar to the 
bottom of the error bar. The bar control looks like: 


Li 


The no bar control looks like: 


Equal Axes 
The equal axes control lets you change the dimensions of the axes 


when displaying a scattergram or line chart of a comparative 
percentile. The default is to have the axes be equal. The unequal axes 


control is: 


+ 
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Layout of StatView II 
Graphs 


Clicking this lets you change the axes. To make them equal again, 
use the equal axes control: 


4 


No Symbols 


The no symbols control lets you remove plotting symbols from line 
charts. Each line will then be differentiated by different line pens. 


The no symbols control is: 


Clicking on it changes it to the add symbols control: 


od 
—" 


Before you start using StatView II's drawing tools, it is useful to see 
how to modify graphs. The view window shows your graph with any 
changes you have made in it. 


Layers 


The window has four /ayers that are manipulated separately: 


Front 





Back 


Anything on a higher layer will have precedence of display. For 
example, if you add a solid black box to the drawings layer, it will 
cover up whatever is behind it on the statistics layer. The legend is 
always seen on top. 


The lowest layer, the background, is a single plane of color which 
you can control. 


The next higher layer is the statistics layer. This is the chart that 
StatView I makes. Objects which appear in this layer are the axes 
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and chart contents area. For example, the contents area for a Box 
Plot would include all the box plots, all the outliers, and the frame 
around the graph. Scattergram would include all the points as well as 
the frame and any regression lines or confidence bands. 


The next layer higher is the drawings layer. The elements of this 
layer are the shapes and text you add with the drawing tools as well 
as any text created by StatView II. Within this layer, elements most 
recently drawn will cover earlier elements. For example, if you make 
a small gray box then a large black box in the same location, you 
will only see the black box. Note: You cannot put shapes behind the 
Statistics or background. 


The highest layer is the legend. This means that you can always see 
your legend, regardless of what you draw in the drawings layer. This 
is useful since the symbols in the legend are also StatView II 
controls for choosing the shape and color of the points in your 
drawing. 


Resizing 


The view window can be easily resized. By dragging the size box in 
the lower right corner of the window, you can shrink or enlarge the 
window. Clicking on the zoom box will make the graph the size of 
the full screen. When the window is the full size, clicking the zoom 
box again will bring it down to the size it was before you zoomed it 


up. 


When you resize the view window, StatView II adjusts the relative 
positions of objects to fit inside the new window size. Any object 
you have drawn or resized is kept the same size in the new view 
window. Any object for which you have not specified a size will be 
adjusted for the new window size. 


To resize the chart, click the mouse inside the chart's frame; a dotted 
outline of the frame will appear. 


Scattergram for columns: X; V4 
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Dragging the grow box in the lower right corner of the chart frame 
changes the size of the frame. 
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Preferences 


The Preferences command in the Graph menu lets you specify 
many of the chart's general appearances. These settings are saved 
when you leave StatView II and are used each time you run the 
program. 

Note: These settings are saved in the StatView Library file; you 
should be sure that this file is in the same folder as StatView II so 
that your preferred settings are used each time you run the program. 


The command's dialog box looks like this: 


Set Graph Preferences 


Distinguish variables by: 


@ paint type Ocolor © both 
Select order in which to use point types: 


first OO ACO+X@BASPOOSBYV FY CO last 


Select ortier in whictr ia use calars: 


first 


Select default text attributes: 


Font size 
[we | (_Cance! |) 





For most charts, the most important selection in the dialog is the 
Distinguish Variables By choice. Generally, you will want to: 


* choose Point Type if you are displaying your results on a 
non-color system or using a monochrome monitor 


¢ choose Color if you are using a color monitor or a 
monochrome monitor with gray scales 


Choosing Point Type causes StatView II to use different point 
shapes for each variable; choosing Color causes StatView II to use 
the same point shape but different colors. Choosing both lets 
StatView II change both point type and color for each variable. 


If you are on a non-color system but wish to differentiate your 
variables by color, they will still appear on your screen as black and 
white. When you print your chart on a color output device, the 
objects will appear in color. To find out the color assigned to a point, 
select the point in the legend and choose the Color command in the 
Graph menu. Color is described in greater detail in the next section. 


One of the next two selections is available, depending on whether 
you chose point type or color. To change the order of points, click 
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and hold down the first point you want to change. For instance, if 
you selected Point Type and want the third point to be a solid circle, 
click on the current third element (the hollow triangle). The 
following menu appears under the pointer: 





While still holding the mouse button down, move the pointer to the 
solid circle and release the button. StatView II rearranges the other 
point types to match your request. 


If you are using color, the actions are the same. When you select the’ 
color you want to change, the color menu appears. Drag to the color 
you want and release the mouse button. 


The default setting is to use both point point type and color. If you 
will always be using a non-color system, you should choose point 


type only. 


Remember that, if you choose to distinguish variables by color only, 
StatView I uses the first point in the point order list as the plotting 
symbol. 


You can also use this dialog box to set the default font and text size 
of text drawn in the graph view windows. To change either the font 
or the size, click on the box, and a menu appears under the pointer. 
Drag to the font or size you want and release the mouse button. If 
you are printing on a LaserWriter, it is likely that you will want to 
choose a LaserWriter font such as Times or Helvetica. 


Colors 


StatView I operates on systems that support color as well as those 
that do not. Non-color systems include black-and-white Macintoshes 
(SE and Plus) as well as Macintosh IIs with monitors set to less than 
16 colors or grays. Color systems include all Macintosh Ls (or future 
machines) with monitors set to 16 or more colors or grays. 


If your system is using a monochrome monitor, StatView II supports 
different gray scales. If you are using a non-color system, StatView 
UJ supports the eight old QuickDraw colors: black, red, green, 
yellow, blue, magenta, cyan, and white. However, if you are using a 
color monitor, it 1s likely that you want to use the Macintosh II's 
color capabilities to their fullest. Even if you don't have color or are 
running on a non-color Macintosh, you can still take advantage of 
the eight old QuickDraw colors (and thereby use color output 
devices). 
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Drawing with 
StatView II 


StatView II takes advantage of color in many ways. It can use either 
8, 16, or 32 colors. StatView II can use any of the 16 million colors 
in each palette slot. If your monitor can display 16 colors, the 
StatView II color pPalette has 16 colors; if your monitor can display 
256 colors, the StatView II color palette has 32 colors. If you use a 
monochrome monitor, you can have 16 or 32 shades of gray. 


To tell StatView II which colors you want in the palette, choose the 
Edit Palette command in the Graph menu. Click the color you want 
to modify, and Apple's Color Picker dialog box appears. 


To change the color in the Color Picker, drag the pointer around the 
color wheel. The top portion of the color box on the left shows the 
new color. The bottom part shows the original color. You can also 
change colors by clicking on the numerical values associated with 
the colors. To add black or white to a color, drag the scroll bar. 


When you are finished choosing colors for your palette, click OK to 
save the new colors or Cancel to leave the previous selection alone. 
The Default button returns the palette to Stat View II's default 
palette. 


The colors you choose are saved in the StatView Library file. You 
get the same colors each time you start StatView II, even if you 
changed your palette when running other programs. 


You can change the color of almost any object in a graph. Simply 


select the object, then choose the Color command in the Graph 
menu. 


So far, you have seen how to create charts and make statistics graphs 
with StatView II. This section shows you how to use the drawing 
tools to embellish your charts with text, drawings, and other art. 


Selecting 


The first tool in the drawing tools menu is the selection tool: 
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You use this tool to select parts of your drawing when you want to 
modify or change it. Clicking on the selection tool changes the 
pointer to an arrow. 


When you select an object, a frame is put around it. For instance, if 
you select a line of text (such as the title), it becomes framed: 


_Scattergram tor columns - Ry V4. 


To select more than one object: 

*> Select the first object. 

*> Hold down the Shift key. 

*> Select the next object. 

You can also select multiple objects by dragging a selection 
rectangle around them. When you have specified the selection tool 
and begin a selection somewhere other than on an object, the pointer 
becomes a finger pointing. Drag that to enclose the objects you want 


to select. When you let go of the button, all the enclosed objects will 
be selected. 


When you give commands and one or more objects have been 
selected, the command affects the selected object(s). 


Selected objects can be moved by clicking inside the frame and 
dragging the frame to a new location. When you move a chart, the 
axes move as well (although the labels do not). 

If two objects in the drawing layer overlap, you can move the front 


one behind the back one with the move to back and move to front 
tools. The move to back tool is: 


The move to front tool 1s: 
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For example, if you draw a box then add some text, you may want to 
put the text behind the box. Select the text and click the move to 
back tool. 


Resize objects by selecting them and dragging their controls or the 
grow box in the lower right corner. You can modify the features of a 
selected item with the various commands in the Graph menu. 


To select all the objects ina sraph window, use the Select All 
command in the Edit menu. The Select Background command 
selects just the background so that you can change the color. 


Cut, Copy, and Paste 


StatView I uses the Macintosh standard Cut, Copy, Paste, and 
Clear commands in the Edit menu. These act on the object or 
objects that are selected when you give the commands. 


The view window of StatView II presents a chart as a group of 
distinct objects such as text, lines, axes, and the legend; any of these 
objects may be individually copied out and pasted back to the view 
window. Also, many of these objects can be cut and cleared from the 
view window with the Cut command. The objects that may not be 
cut or cleared from the view window are: 


¢ the legend 
* axes 


* the content area of the graph (such as the plotted points of a 
scattergram) 


If necessary, you can remove the legend by giving the Hide Legend 
command in the Graph menu. To remove an axis or the axis lines of 
the content area, select the item and make its color the same color as 
the background. This effectively hides the item. 


When any graph objects are copied in the view window, StatView I 
copies a picture that looks exactly like the object, making it static. 
As a result, if you copy an axis, paste it back in with the Paste 
command, and then change the original axis’ bounds with the Open 
Axis command, the pasted axis will not match either the graph or the | 
original axis. 


You can paste pictures copied from either StatView II or any other 
program into a view window. You can then resize the picture or 
move it around as you wish. Pasting into the View window will not 
replace a selected object. The newly pasted object will appear on the 
top of the graph in the drawing layer. 


You have full cut, copy, and paste control over text items and items 
you draw in the view window. When you cut or copy text with 
StatView I, any style information you may have specified is 
remembered along with the text. Some applications ignore the style 
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information when text from StatView II is pasted into them. This is 
often true for the subscript and superscript styles. 


Note: If you want to copy the whole graph (including all the drawing 
items you have added) to the Clipboard, use the Copy View 
command in the Edit menu. This saves the chart in Macintosh PICT 
format; this allows other object-based programs like MacDraw to 
manipulate the parts of the view separately. Remember if you use 
the Copy command, you will only copy out selected objects. 


If you are displaying a table view, you are not allowed to cut, paste, 
or clear items to and from the table view. You are, however, allowed 
to copy the entire table with the Copy View command. If you want 
to copy numeric values or table tutles of an analysis result, you can 
select one or more values from the table and copy those numbers out 
using the Copy command. This is particularly useful for pasting 
numeric results into spreadsheets, word processors, and data base 


programs. 
Rulers and Grids 


StatView I's ruler and grid allows you to precisely place any text 
and drawings you have added to your chart. 


The Show Ruler command in the Graph menu turns on the rulers 
around your drawing: 


oo 
° 
- 
a 

D 

A 4 
=] 

& 


180 200 220 240 





As you move the pointer around in the chart, the position of the 
pointer is indicated on the ruler by the moving lines. 


For example, if you want to line up the tops of two objects: 
*> Select the first object. 


*> Move the pointer to the top of the object and note the position of 
the indicator line in the ruler in the left margin. 


*> Select the second object. Move the cursor to the line at the top of 
the object and drag it to the position you want. 
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*> Note the position of the indicator line on the ruler. When it is the 
Same as the first object, the tops are lined up. 


You can change the zero point of your ruler, if you wish. Click in 
either ruler or in the intersection of the rulers and drag the gray line 
horizontally or vertically to where you want the zero point to be. 
You can reset the zero point by clicking in the same rectangle. 


You can change the markings on the ruler with the Custom Rulers 
command in the Graph menu. This command's dialog box is: 


Choose Ruler's Units: 
@ Inches © Centimeters 


Choose number of divisions per Unit: 


O1 02 04 O5 ©8 O10 


cz 





Select the number of gradations you want and whether you want the 
rulers to be in centimeters or inches. 


The Turn Grid On command in the Graph menu constrains your 
movement when you draw or move objects. This forces you to draw 
objects at intervals of your ruler gradation. It assures that if you 
place something while the grid is on, it will line up with other 
objects that are also placed when the grid is on. To stop this action, 
use the Turn Grid Off command. 


If you have objects that you have already drawn and later decide you 
want them to be aligned with the grid, select them and give the Align 
to Grid command in the Graph menu. This can only be done after 
you have given the Turn Grid On command. 


It is likely that you will want to add text to some of your charts. This 
text might be something as simple as a short label, or might be a long 
description of the statistics shown in the chart. 


To add text to a chart, select the text tool: 


The pointer becomes an I-beam, similar to the one you see in word 
processing programs. When you click in the chart, StatView II draws 
a box to indicate where the text will go: 
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Drawing Objects 


Type in whatever text you want. If your text is more than one line, it 
will justify itself within the size of the enclosing rectangle. 


You can easily change the size of the text box. Click on the selection 
too] (the arrow at the top of the drawing tools), then click and drag 
the small square in the lower nght corner of the text box. This allows 
you to make the text box as large or small as you want. You can also 
resize the rectangle with the text tool: when you move over the size 
box, it turns to the selection tool so that you can drag the size box. 


Changing the look of the text is also easy. In the Text menu, use the 
pull-across menus for the Font, Size, and Style commands. You can 
change the text as a whole or select only part of the text before you 
change the attributes. Simply select the characters you wish to 
modify and choose commands in the Text and Graph menus. 


You can also change how the letters are aligned within their text box 
with the Left Justify, Center Justify, and Right Justify commands. 
You can display your text rotated with the Rotate Left and Rotate 
Right commands. Note that if you have rotated your text, you must 
return it to horizontal onentation before editing it. You can change 
the color of the text by selecting the text box or the color of a 
character or a group of characters by selecting these characters and 
choosing the color with the Color command in the Graph menu. 


Note that on non-color systems rotated text can only be a single 
color for all characters. 


Although this discussion has been about text you add yourself, it also 
applies to text created by StatView II. You will notice that StatView 
II provides a default title for each graph which gives information 
about which variables are in the chart and the type of chart. You will 
probably want to customize this title to fit your own needs. You can 
also change the text in the legend and in the axis labels. 


Like the text tool, you will also find the other drawing tools useful 
for adding information to graphs to make them more understandable 
or more attractive. The tools are for lines, rectangles, rounded 
rectang:ies, and ellipses. 


For example, to add a box around a group of points: 


«> Select the rectangle tool. The pointer becomes a cross-hair: 
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*> Move the pointer to where you want one corner of the box. 
*> Click and drag the pointer to where you want the opposite corner. 


To draw squares and circles, use the rectangle and ellipse controls. If 
you hold down the Shift key before you click for the first corner, 
StatView II will restrict your drawing movements to make make the 
object a square or circle. 


When you add an object, StatView I shows you its controls. The 
controls are the small squares around the object: 


EL | 


You can resize the object by dragging one of its controls. For 
instance: 


*> Select the selection tool. 


*> Click and drag the lower left control away from the rectangle. 
When you release the corner, the rectangle grows. 


To make an object grow or shrink in both directions by a 
proportional amount, select the object, hold down the Shift key, then 
drag one of the corners. StatView II constrains the growing and 
shrinking. 


If you want to only stretch or shrink the object in one direction, 
choose a control on the side of the object instead of in the corner. 
These controls only let you move in one direction. 


To move the object, select the middle of the object and drag it 
around the chart. 


The objects you draw have drawing attributes that you can change 
with the commands in the Chart menu. These are: 


Color The color for the object. 
Pen The pattern of the line in the object. 
Fill The pattern of the interior of the object. Choosing 


“None” makes the object transparent. If you are 
using color and want a solid color, choose black. 


Line Width The thickness of the lines used in the object. 


Arrow Heads You can add arrow heads to lines. 
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Changing the Axes 


Point Type Changes the selected point to a different shape. 
The point must be selected in the legend. 


Point Size | Changes the selected point to a different size. 


Feel free to experiment with these drawing attributes. Although the 
Standard choices may suit you well, it is likely that you will find use 
for many of the attributes as you become more proficient with 
StatView II. 


The axes are often very important in the representations of charts. It 
is important that they convey the correct range at the correct 
intervals to show the meaning of the data. Stat View II gives you a 
great deal of flexibility in the way that the axes are displayed. 


You can not edit the text in the axes. That is, you can not select some 
of the text and change it. However, you can select the axes and apply 
all of the formatting and styles you have seen above. 


To change an axis, select it by clicking in one of the values with the 
selection tool. Stat View II draws a rectangle around the axis: 


ECE R EES OEE EEE OEE EEE EEE EEE ORES ESHEETS EEE EEE EEE FEE EEE OTHERS EEE EEE EOE EEE FER HEE EEE EEE EEE SEE EES EEE EERE SSS OSEE EEE CEE SEH ROE 


Give the Open Axis command in the Graph menu. (If you give the 
Open Axis command without selecting an axis, it will let you 
modify both axes.) The dialog is: 


Horizontal Axis Information 


Bounds: 


From: To: DLock bounds 
Length: |2.542 (in.) 


Scale: @linear Olog Onormal 


Tickmerks: 
Lie: @outside Oinside Oboth Oneither 
Per Major Interval: @auto Onone ©1003 O7 O89 


Grid Lines at: QO nolines @zero ©Omeajor ticks © minor ticks 





Bounds You can set the lower and upper numeric bounds 
for the axis. When StatView II prepares the graph, 
it chooses the bounds so that the entire range of 
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points will appear. You can, however, stretch or 
shrink these bounds by entering values into the 
From and To text boxes. 


If you specify new bounds, the check box besides 
Lock bounds will automatically be checked 
signifying that these values are locked as the 
bounds for your graph. The bounds will stay in 
effect for both the paging and composite view of 
this graph no matter what values are added to or 
removed from the graph. If Lock bounds is 
checked and you add a value outside the bounds, 
they will not appear on the chart. 


If you wish to unlock your bounds, simple remove 
the check box from the lock bounds, and 
StatView I will once again automatically set the 
from and to boundaries. 


You can also specify how long the axis should be. 
There are two methods for changing the length of 
an axis. You can change the number in this dialog 
box, or, when viewing the graph, select the 
outline of the graph with the selection tool and 
stretch it by dragging a control. 


The axis scale can be linear, logarithmic, or 
normal. 


Tick Marks The ticks can be shown outside or inside the axis, 
or both. If you want no tick marks, choose 
Neither. You can specify how many tick marks 
appear for each major interval. 


Grid Lines StatView II normally only puts a grid line at 0. If 
you want to add other gnd lines, specify which 
you want here. 


Note that, when you resize a graph, the axes are automatically 
resized with it. 


The Lege nd The topmost layer in a graph is the legend. Like the other parts of 


Ll4 


your graph, the legend is easy to modify. When StatView II creates 
the legend, it assigns it a default width. If the legend text does not fit, 
the text is followed by ellipses. 


© caucasian 
CO black 

A hispanic 
© asian 
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When Customizations 
Disappear 


To change the width of the legend, select it with the selection tool 
and drag on the bottom corner. 


:O cauc... : 
‘Oo black | 
i Dhisp... ; 
'Oasian | 


i “he 


When the legend is selected, you can change the color, font, size, 
and style of the text items with the various commands in the Graph 
and Text menus. Note that, although you can edit legend text 
individually, you cannot change the attributes (font, style, size, 
color) of each item in the legend, only of the legend as a whole. 


You can switch the orientation of the legend with the Horizontal 
Legend and Vertical Legend commands in the Graph menu. 
Making a honzontal legend is convenient if you want to run the 
legend across the top or bottom of the graph: 


Ocaucesian DO black L hispanic © asian 


If you don't want the legend to show at all, use the Hide Legend 
command in the Graph menu. To bring back a hidden legend, 
choose Show Legend. 


The legend 1s the control area which is used to change the points, 
fills, and patterns in a graph after it is drawn. To change the shape or 
color of a set of points, the fill or color of a bar chart or pie slice, or 
the pen or color of a line, select that contro] in the legend. 


CO black 
J hispanic 
© asian 


Then give the appropriate command in the Graph menu. 


The box plot, error bar, and comparative bar charts do not have 
legends; instead, each variable is labelled individually on the X axis. 
To change the attribute of a box or bar, select it in the graph, then 
give the desired command in the Graph menu. 


If you change a chart, your customizations can be lost if you change 
to a different view or choose a new statistic. The tables below 
describe what actions affect your graph customizations. 


The following actions WILL cause your graphic customizations to 
be lost: 
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Using StatView IT's 
Drawing Features 


* changing from one statistic to another 


¢ changing from one view type to a different view type (except 
for those noted below) 


* removing a variable which appears on a paging view 


The following actions WILL NOT cause your graphic 
customizations to be lost: 


* causing a Statistic recomputation or graph redrawing by 
including/excluding rows, adding or editing a range 
restriction, changing a value in a data column 


* switching to and from a table view 
¢ changing between scattergram view and line chart view 
* toggling between paging and composite views 


* changing a Statistic parameter (such as from simple to 
polynomial regression) 


* removing a variable which appears on a composite view 


StatView II updates the default graph titles and axis labels 
automatically when you change any variables in your chart unless 
you have changed the text in the title or label. The only exception to 
this occurs when the chart title is an equation for a regression. These 
equations will always be updated if the regression is recomputed. 


To avoiu ivsing your customizations, be sure you have fully 
experimented with your data before you begin customizing your 
chart. 


This section gives you an extended example of how StatView II's 
drawing features are used. The first example customizes a 
scattergram. | 


*> Activate the Lipid Data window. 


*> Choose Quick Assignment in the Vars menu. Select Weight as 
X and Cholesterol and Chol-3yrs as Y. 


*> Choose Scattergram in the View menu and zoom the window to 
full sized. 
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Scattergram for columns: X,Y; ~. %1; V2 


400 


350 
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e> Select the legend and drag it until it 1s the upper left hand corner 
of the chart frame. 
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«> Select the chart frame and drag the grow box until the frame 
rests near the right edge of the window. 


Scattergram for columns: X;V; — X%;¥2 


400 





350- OTrig-3yrs 
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Oo 


100 120 140 160 180 200 220 240 


«> Draw a box around the legend. 


© Cholesterol 


0 Trig-3yrs 





«> Select the chart frame and move bottom of the chart up. Move 
the X and Y axis labels so they are placed evenly with the new 
frame size: 
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Scattergram for columns: X;¥_ .. %1¥2 


© Chelesterol 
O Trig-3yrs 


Cholesterol 
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Weight 


*> Select the text tool. Click under the bottom of the chart: 


Neccesoeeseeeres SCRA e eee eee renews sensanee Mane 


*> Type “Illustrauon 1: Lipid measurements for medical students” 
and resize the text box to show all the text. 


oe 
r4 . 
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*> Select “Ulustration 1” and from the text menu change its font size 
to 10 pt and its font style to bold. 


Illustration 1: Lipid measurements for medical students 
«> Select the line tool. 


*> Click the line tool near the top extreme Trig-3yrs value and drag 
to the nght. 


*> Choose Arrow Heads in the Graph menu. Drag to the single 


head on the left: 
= ta 


*> Select the text tool. Click under the bottom of the arrow: 


«> Type “Outlier” 
CO Pl 


The next example customizes a box plot. 


outlier 
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*> Choose Quick Assignment in the Vars menu. Clear any X or Y 
variables. Select Cholesterol and Chol-3yrs as X variables. 


«> Choose Percentiles in the Describe menu. 


«> Choose Box Plot in the View menu and zoom the window to full 
sized. 


Box Plots for columns: ®; ... X2 


Units 





Cholestero] Chol-3yrs 
Columns 


*> Select the Cholesterol box plot by clicking in its middle. A 
frame will appear around the box. 


FOOT OOS ESHER OE HOOT HEEEE HE REE EEE HORE EER SOD ES EEE SOE OOS . 
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*> Choose Fill Pattern in the Chart menu. Drag to the medium 
gray and release the pointer. This fills the selected box plot with 
the lightest gray pattern. 


Fe . 
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*> Repeat this procedure for the Chol-3yrs box. This time fill the 
box plot with the diagonal black line fill pattern. 
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Select the vertical axis label ''Units". 


Choose Rotate Right in the Text menu. You will now be able to 
edit this text. 


Select the text tool. Double click in the text field to highlight the 
entire word. 


Necevesceseceees 


Type “Cholesterol (mg/dl)” and resize the text box until all the 
text is visible. 


Choose Rotate Left in the Text menu. 


: 
’ 


i260 
'240 
i220 


: 200 


Cholesterol (mg/dl) | 


Choose Show Rulers in the Graph menu. 
Click in the top left corner of the ruler and drag down the gray 


line until it lines up with the top of the chart. The zero point is 
now the top of the Y axis. 


0 300 
Bia 


Select the vertical axis label and drag it until it is centered. 
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Explore the other drawing features available on the Text and Graph 
menus. If you have a color system, experiment with assigning colors 
to vanous objects on the screen. 
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Overview 


StatView II computes descriptive statistics for any dataset column 
assigned as an X variable. The descriptive statistics calculated are: 


mean 
Standard deviation 

standard error of the mean 

variance 

coefficient of vanation 

minimum 

maximum 

range 

sum 

sum of squares 

number of non-missing values 

number of missing values | 

t or norma] distribution confidence intervals 
10th percentile 

25th percentile 

50th percentile 

75th percentile 

90th percentile 


number of values below 10th percentile 
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Numeric Descriptive 
Statistics For A Single 
Variable 


* number of values above 90th percentile 
* mode | 

* geometric mean 

¢ harmonic mean 

¢ coefficient of kurtosis 

¢ coefficient of skewness 

¢ frequency distribution 


The descriptive statistics fall into two categories: those above the 
gray line on the Describe menu and the one (Frequency 
Distribution) below the gray line. Any combination of the 
descriptive statistics above the gray line can be chosen. 


The statistics chosen are checked on the menu. Deselect a statistic by 
choosing it again. When all statistics are deselected, None is 
checked. Choosing None will deselect all Describe statistics. 


If you choose Frequency Distribution, any checked statistics above 
the gray line become unchecked. Remove Frequency Distribution 
by choosing None or any statistic above the line. 


The following discussion refers to information from the 
Learning StatView IT chapter. You should be familiar with 
the cor ‘~»ts discussed in the following sections of that 
chapter: Assigning Variables, Graphing Data, Selecting 
a Statistic, and Viewing Results. 


This example deals with the nongraphic descriptive statistics that 

you might wish to compute for a single variable (that is, a single 
Stat View II data column). The dataset we are examining is é again the 
Lipid Data included on the StatView II diskette. 


*> Open or activate Lipid Data. 


*> Assign an X to the column associated with the variable that you 
are interested in describing. For our dataset, we are interested in 
describing the variable Cholesterol. Double click the cursor on 
the variable name Cholesterol to assign it as X, . 


*> Choose Mean, Std. Dev., etc. 
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«> Choose Mode. 

«> Choose Percentiles. 

*> Choose Geometric Mean. 

*> Choose Harmonic Mean. 

‘> Choose Kurtosis & Skewness. 


«> Choose Table from the View menu. (Some views on the menu 
are grayed out. They are not available for the selected statistic.) 


As you become experienced with StatView II you may not wish to 
make all these selections. Indeed, many people will find the first 
three selections sufficient. 


The View window appears, and the results of the checked statistic 
for the Cholesterol X, variable are displayed in tabular form: 


x1: Cholesterol 
















Mean: Std. Dev.: Std. Error: Variance : Coef. Var.: Count : 

[is12s2_[aseve [sce |tz7zeas|isess 

Minimum : Maximum : Ranoe : Sum: Sum Squared: ™ Missing: | 
18167 


®< 10th @: 10th &: 25th &: 50th &; 75th &: 90th €: 


*>90thS: Mode: Geo. Mean: Har. Mean: Kurtosis : Skewness : 


The first two rows of this table represent the statistics associated with 
the Mean, Std. Dev., etc. selection. The third row and the first entry 
of the fourth row represent the percentiles selection. Each of the last 
five entries of the final row represent the other descriptive statistics 
selections and are labeled accordingly. By following the statistics as 
reported here, you can generate many descriptive statistics. 


Measures of Central Tendency and Skewness 


If the distribution of Cholesterol] values is symmetric, but not nec- 
essarily normal, then the 50h percentile (also referred to as the 
median), mean, and mode will be identical. For these data, the mode 
is 190. By definition, the mode is the most frequently occurring 
value in a distribution. The mode is the least valuable measure of 
central tendency, only telling us the most frequently occurring 
observed value. If there is more than one most frequently occurring 
observed value, there is no mode. 


The mean and median are better measures of central tendency. The 
median is 191 for our Cholesterol variable and is approximately .232 
units below the mean of 191.232, suggesting that the distribution is 
slightly skewed. More than 50 percent of the values are below the 
mean value of 191.232, the median is less than the mean, thereby 
suggesting a positively skewed distribution. Thus, the nght tail of 


Chapter 4— The Describe Menu = 125 


126 


this distribution is longer than the left tail. This suggests that the high 
Cholesterol values tend to deviate more from the mean than the low 
Cholesterol values. 


A similar conclusion could have been reached by referring to the 
index of skewness, .302. This index is the average of the cubed 
standard scores (or z-values) of the distribution. If the average of the 
cubed standard values is 0, the distribution is symmetric, suggesting 
that the extreme values are evenly distributed above and below the 
mean. If it is negative, the distribution is negatively skewed 
suggesting that the majority of extreme observed values are less than 
the mean. If it is positive, the distribution is positively skewed, 
suggesting that the majority of extreme values are greater than the 
mean. 


Select the minimum value in the distribution of Cholesterol, 115, and 
change it to -1000. Notice when you view the table again that the 
median has not changed, while the mean has been reduced 
substantially. This demonstrates the stability of the median and the 
instability of the mean as measures of central tendency. The median 
is the single best descriptive measure of the central tendency of a 
distribution. 


Variance and Standard Deviation 


The variance is a measure of the dispersion or the extent to which 
there are individual differences in the distribution of values. For the 
Cholesterol data, the variance is 1272.648, indicating that, on the 
average, the squared difference between any value and the mean is 
1272.648. Stat View II uses the variance formula for an unbiased 
sample estimate rather than the formula for a population. As a 
consequence, when comparing a StatView II variance estimate to a 
variance estimate obtained by some other procedure, you might 
occasionally find that the StatView II estimate is a small amount 
larger if the alternative method has used a formula for a population 
variance. The differences between the two formulae are small and 
only worth noting so you can be assured that any comparative 
discrepancies that you might observe are due to a subtle difference 
between formulae. 


The standard deviation is simply the square root of the variance. For 
Cholesterol, it is 35.674. Depending upon discipline and preferences, 
you may want to report either a variance or a standard deviation, but 
usually not both. 


Maximum, Minimum and Range 


The maximum, minimum, and range are especially valuable in that 
they help you check the data quickly. The minimum is just the 
smallest value in the dataset, 115 for our Cholesterol variable, while 
the maximum is the largest value in the dataset, 285. You generally 
have an idea of what the span of values should be in a given 
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distribution. Should either the minimum or maximum value be 
substantially lower or higher than expected then you have probably 
entered a value incorrectly into the dataset. 


While it is very difficult to directly interpret a standard deviation as 
being either large or small it is possible to compare the range to the 
Standard deviation. This comparison will provide some sense of 
whether a distribution is homogeneous (small variance), or 
heterogeneous (large variance). The ratio of the range to the standard 
deviation should typically define some value between 2 and 6. For 
our Cholesterol data, with a range of 170 and a standard deviation of 
35.674, this ratio is approximately 4.76 . Because 4.76 is near the 
center of the traditional range of 2 to 6, our data are neither 
heterogeneous nor homogeneous. Had our data defined a ratio near 
2, or even less than 2, we would have concluded that our Cholestero] 
sample was extremely homogeneous. By similar logic had our data 
defined a ratio near 6, or even greater than 6, we would have 
concluded that our sample was extremely heterogeneous. 


Kurtosis 


When descnbing a distribution, four measures are typically 
provided: the mean, variance, skewness, and kurtosis. We have 
briefly discussed the first three. Kurtosis refers to both the peak, the 
center of the distribution, and tails of a distribution. The kurtosis of a 
distribution is computed as the average fourth power of the standard 
scores minus 3. 


A distribution may be characterized as leptokurtic, extremely peaked 
with “slim” tails; plarykurtic, extremely flat with fat tails; or 
mesokurtic, modestly peaked with modest tails. For a normal 
distribution the average fourth power of the standard scores is 
exactly 3. Since by convention 3 is subtracted from the average 
fourth power, the kurtosis of a normal or mesokurtic curve is 0. A 
leptokurtic distribution would be characterized by a positive index of 
kurtosis, a platykurtic distribution would be characterized by a 
negative index of kurtosis. The larger the absolute value of the index 
of kurtosis, the more extreme the kurtosis. 


For our Cholesterol] distribution the kurtosis is .036, indicating that 
the distribution is mesokurtic, neither peaked nor flat. If it were 
platykurtic, its fat tails would indicate that there are more extreme 
values in the distribution than you would find in a normal 
distribution, whereas the slim tails of a leptokurtic distribution would 
Suggest that there are fewer extreme values in the distribution than 
you would find in a normal distribution. 


How fat can the tails get? It is possible to define a concave 
distribution, U shaped, where virtually all of the values are in the 
tails of the distribution. Such a concave, extremely platykurtic 
distribution could have a kurtosis index that might exceed -2. By the 
same logic, you might wish to define a peaked distribution with very 
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long, low tails. Such a distribution would be extremely leptokurtic 
with a kurtosis index that might be as large as 2. 


Coefficient of Variatioi 


One of the more intriguing topics in basic statistics deals with the 
scale of the variable being analyzed. An in-depth discussion of this 
topic is beyond the scope of this manual. However, it is necessary to 
briefly touch on it in order to appreciate the coefficient of variation, 
the harmonic mean and the geometric mean. Variables usually 
assume the characteristics of one of four scales: nominal, ordinal, 
interval or ratio. When Karl Pearson was refining much of what we 
call “descriptive statistics,” he was working with ratio variables. 
Such variables have no negative values, a zero that implies an 
absolute lack of whatever is being measured, and an equal distance 
between any two consecutive values. 


An example of such a ratio variable is limb length. It is well 
documented that the variation of the length of a front limb is related 
to the average length of the front limb. For instance the average 
length of the front limb of a human is substantially greater than the 
average length of the front limb of a hamster. We find that the 
standard deviation of the length of human arms is also substantially 
greater than the standard deviation of the length of the front legs of 
hamsters. You could not compare the standard deviations of the front 
limbs of the two species and conclude that there is greater front limb 
variation in humans than there is in hamsters. 


In situations where two populations differ appreciably in their 
means, assuming a ratio variable, the coefficient of variation should 
be used to compare variation. It is computed as the ratio of the 
standard deviation to the mean multiplied by 100. This coefficient is 
independent of the unit of measurement and will usually range from 
some value greater than 0 to 100. It is possible with some extremely 
heterogeneous data to have coefficients greater than 100. 


Our Cholesterol variable is a ratio variable. An observed Cholesterol 
value of 0 means that there is absolutely no trace of cholesterol 
present in whatever is being sampled. In the human population 
Cholesterol values typically range from 118 to 300. The coefficient 
of variation for our Cholesterol is 18.655. 


A pressing question that should be addressed deals with the 
magnitude of the coefficient of variation. How big is big? Simpson, 
Roe and Lewontin (1960), a reference prior to the widespread use of 
computers, reports that a majority of values in the biological sciences 
are within the range of 4 to 10, with 5 or 6 being a reasonable 
estimate of the average value. All values reported by Pearson (1898) 
in his original work were less than 5. In the social sciences informal 
studies of non-ratio variables have suggested that the coefficient of 
variation typically ranges between 15 and 30, with small samples of 
fewer than 30 observations tending to define larger coefficients than 
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large samples. Clearly the biological sciences work with more 
homogeneous variables than the social sciences. 


A relatively small coefficient of variation, indicating extreme 
homogeneity, would suggest that the associated variable is not 
sufficiently sensitive to represent variation within the sample being 
measured. Alternatively, a relatively large coefficient of variation 
might suggest than an instrument is not measuring a single 
dimension. 


If you are using a variable for which an observed value of 0 implies 
something other than the absolute lack of whatever the associated 
variable is measuring, then you are not using a ratio variable and the 
coefficient of vanation could be seriously distorted, perhaps 
corrupted. It is best to ignore the coefficient of variation unless you 
are working with ratio variables. Such variables are typically found 
in the biological and physical sciences, but seldom found in the 
social sciences. 


Percentiles 


The choice of percentile does not determine the observed values 
associated with the 99 percentile points that exist in any distnbution. 
Instead, StatView II identifies the observed values associated with 
five critical percentiles: the 25th, 50th, 75th, 10th, and 90th. The first 
three of these are referred to as the first, second and third quartile 
points, respectively; the second quartile 1s also called the median. 
The 10th and 90th percentiles are usually the “rule of thumb” points 
that segregate the extreme poruons of the distribution from the rest 
of the distribution. 


To the extent that the difference between the values associated with 
the 25th and the 50th percentiles equals the difference between the 
values associated with the 50th and 75th percentiles, the middle 50 
percent of the associated distnbution will tend to be symmetric. In 
addition, if the difference between values associated with the 10th 
and 50th percentiles is also equivalent to the difference between the 
values associated with the 50th and 90th percentile points, then the 
middle 80 percent of the distribution will be symmetnic. 


Standard Error of the Mean 


The Std. Error item in the Describe menu stands for the Standard 
Error of the Mean. If we were to assume that our data came from a 
population, which we do when we compute our standard deviation, 
we might wonder about the adequacy of the mean in terms of 
representing the population mean. If we were to assume that our 
sample of Cholesterol values was randomly selected from a 
population of Cholesterol values we could make certain statements 
about what to expect if we were to take subsequent samples of 95 
Cholesterol] values from the population. If we were to compute the 
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mean for each sample of 95 Cholesterol values we would eventually 
have a sampling distribution of means. 


The mean of this sampling distribution would be the population 
mean and the standard deviation of this sampling distribution would 
be equivalent to the population standard deviation, estimated to be 
35.674 from our data, divided by the square root of the sample size, 
or the count, used. The estimated standard deviation of the sampling 
distribution of means is approximately (35.674/9.74) or 3.66, which 
is also referred to as the standard error of the mean. We would 
expect that approximately 68% of all sample means would be within 
3.66 Cholesterol units of the population mean. Thus, for our 
Cholesterol data it would appear as though there would not be too 
much fluctuation in the mean from sample to sample. In a sense, the 
standard error of the mean is an indication of just how accurately the 
population mean can be portrayed by the sample mean. To get a 
better understanding of the use of the standard error of the mean as a 
descriptor, you should refer to a basic statistics book that discusses 
confidence intervals. 


Geometric Mean 


The geometric mean is not a general descriptor of the values of a 
distribution. It is commonly used to average economic indices. It is 
most useful when dealing with a variable undergoing a constant rate 
of change. It represents a mean that would be defined by the data if 
they were transformed in a specific manner. Recall in the discussion 
of the coefficient of variation above that there are situations where 
the means and variances associated with a ratio variable tend to be 
systematically related. In such situations, we find that as the mean 
increases, so also does the variance. Frequently the variance can be 
made independent of the mean by representing each value of the 
distribution by its common logarithm. When you compute the mean 
of a logarithmic distribution and then transform that mean back to 
the metric of the orginal untransformed distribution, the result is 
referred to as a geometric mean. Note that if you have a value of 0 in 
the distribution, the geometric mean can not be computed. For our 
Cholesterol data, the geometric mean is 187.925. The geometric 
mean is always less than the arithmetic mean when all values of the 
distribution are positive. 


Harmonic Mean 


The harmonic mean, also derived from transformed data, is most 
often used to average rates and ratios. It is based on a reciprocal 
transformation, which replaces a value by its reciprocal (where the 
reciprocal is defined as | divided by the value). The reciprocal of the 
mean of a distribution of reciprocals is referred to as a harmonic 
mean. Like the geometric mean, it is not really descnptive of the ob- 
served distribution. Instead, it may be thought of as being descriptive 
of a transformed distribution. Note that if you have a value of 0 in 
the distribution, the harmonic mean can not be computed. For our 
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Graphic Description of 
Data For A Single 
Variable 


Cholesterol data the harmonic mean is 184.582. The harmonic mean 
1s always less than both the geometric mean and the arithmetic mean. 


When you use the graphics window to view your data, there are 
several options available with the Describe menu. 


*> Choose Mean, Std. Dev., etc. from the Describe menu. 


*> Choose Scattergram from the View menu. 


Standard Deviation Error Bars 


Note that this scattergram is a univariate rather than the traditional 
bivariate scattergram. A univariate scattergram has only an ordinate, 
as there are no values for the abscissa. Assuming that Cholesterol is 
still the X, variable, a standard deviation error bar should appear: 


One Standard Deviation Error Bars for Column Xj: Cholesterol 
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This standard deviation error bar for Cholesterol has a dot at the the 
mean, 191.232. A line representing one standard deviation unit, 
35.674, 1s extended above and below the mean. This line is intended 
to convey some sense of the degree of dispersion about the mean. 
Usually a majonty of the observed values fall within a band that 
extends from one standard deviation unit above the mean to one 
Standard deviation unit below the mean. 


Univariate Scattergrams 


*> Click the composite/paging tool, the top nght hand control of the 
view window. 
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The graph will be redrawn with all the observed values entered. 
Notice in the following chart that standard deviation error band and 
mean are noted by lines and marked on the nght ordinate. Also note 
that this graph represents each observed value with a circle. The 
vertical placement of the observed values is with regard to the 
ordinate, Cholesterol. The circles are placed horizontally in such a 
fashion that we can count the number of individuals with a particular 
observed value. For instance four individuals have Cholesterol 
counts of 191. A quick count shows that 67 of the 95 observed 
values, or 70%, are within plus or minus one standard deviation of 
the mean. If these data were normally distributed we would expect 
approximately 65 of the 95 observed values, or 68%, to be within 
plus or minus one standard deviation of the mean. 


Scattergram for column: X; Cholesterol 
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Observations 


Confic-“ce Intervals 


Suppose you wished to estimate the population mean from the 
sample mean. We always assume that there has been some sampling 
error with any sample mean. Thus, we can never be sure that the 
sample mean and the population mean are exactly the same. We can, 
however, estimate a band of values which we might confidently 
predict as spanning the population mean. Such a span is a confidence 
interval. You can also specify the probability or degree of confidence 
that you wish to associate with the confidence interval. Such a value 
represents the probability that the band will span the population 
mean with repeated applications. That is, we never know if a 
particular confidence interval spans the population mean, but we do 
know that if we select a probability of .95, then 95 percent of the 
time that we construct such a confidence interval it will span the 
population mean. 


*> Choose Confidence Intervals from the Describe menu, and the 
Confidence Intervals dialog box appears: 
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Select Distribution: 


@ t ©Onormal' Std. Dev.: ees 


Select Confidence Intervals: 
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Radio buttons allow selection from two distributions available for 
computing confidence intervals: the t-distribution and the normal 
distribution. Choose the normal distribution only if you are 
constructing confidence intervals for data in which you know the 
true standard deviation of the population. If you choose normal 
distribution, enter the population's standard deviation in the text 
entry rectangle to the right of its radio button. When you don't know 
the population standard deviation, the sample standard deviation is 
used in conjunction with the family of t-distributions to construct a 
confidence interval. Confidence intervals based on sample standard 
deviations are always wider than confidence intervals based on 
population standard deviations. 


Regardless of whether you have selected a t- or normal distribv't.on, 
you may have up to three probability levels associated with y .. 
confidence interval. Two of the probability levels have been prese- 
lected; they are 95% and 90%. The text entry rectangle next to the 
last check box allows you to enter your own choice for the 
probability to be used for the third interval. If you wish to display an 
interval based upon a single standard error bar, enter 68%. 


«> Click t for the distribution. 


«> Choose the 95% box, the 90% box, and enter 68 in the 
confidence interval text entry rectangle. (Entering values into this 
rectangle automatically checks the associated box.) 


«> Click OK. 


*> Choose Table from the View menu. This table lists the values 
defining each confidence interval: 


X14: Cholesterol 
+958: SSS Lower: 95% Upper : t 906: SO% Lower: 90% Upper: 


T3eho Diksisest spss <- feces [vesass  [rseauz 


t 686: 68% Lower: 68% Upper : 


Ca Se a a a SS 


The band for the 95% confidence band extends from a low of 
183.964 to 198.500; the 90% confidence band extends from a low of 
185.151 to a high of 197.312; the 68% confidence band extends 
from a low of 187.573 to a high of 194.890. The values under the t's 
in table above are the t-values from at table associated with the 95%, 
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90% and 68% confidence levels respectively for 94 degrees of 
freedom. 


*> Choose Scattergram from the View menu. 
68%, 90%, 95% Error Bars for Column Xq : Cholesterol 
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The two interior tick marks are associated with the 68% confidence 
interval and span from 187.573 to 194.89. The second set of exterior: 
tick marks are associated with the 90% confidence interval and span 
from 185.151 to 197.312. The end of the lines represent the 
demarcations for the 95% confidence interval and span from 183.964 
to 198.5. Thus, the probability of spanning the population mean 
within the interval of is .95. Notice that as you become more 
confident that a confidence interval spans the population mean, the 
span associated with the confidence interval becomes wider. 


z-Score Distribution 


A z-score distnbution is a quick method for displaying a standard 
score frequency distribution. It is possible to graph the distribution of 
standard scores associated with any variable within StatView II. A 
standard score is an observed value's deviation from the mean, 
expressed in standard deviation units. 


Assuming that Cholesterol is still X, and that you entered the 
information for the confidence intervals, it is possible to obtain a 
graphic representation of Cholesterol in standard score form (also 
referred to as z-scores). Note, that the standard score information is 
also available if Mean, Std.dev., etc is selected. 


*> Choose Bar Chart from the View menu. You should see a chart 
with a single bar in the center: 
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Bar Chart of Column Mean X; : Cholesterol] 
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This chart is, unfortunately, not too informative. It simply represents 
a bar whose height on the ordinate is comparable to the mean of 
variable X}. 


*> Click the composite/paging tool. 


The graph changes to a histogram representing the z-score frequency 
distribution of Cholesterol. The abscissa, or baseline, associated with 
the z-scores usually ranges from +3 to - 3. The bars are drawn one z- 
score unit wide while the ordinate represents the frequency of z- 
scores associated with each bar. 
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A z-score of 0 is always associated with the mean. For this 
illustration of our Cholesterol data, the tails of the distribution 
extend to -3 and +3. Frequently it is possible to see the skew of the 
distribution when looking at this graph. For our Cholesterol] data 
there is only a slight positive skew, which 1s difficult to perceive in 
this graph. 
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Box Plot 


Tukey originally developed the box plot. Note that our box plots are 
based upon the work of William Cleveland and differ from Tukey's 
box and whisker plots in the manner in which outliers are plotted. In 
the numerical description of Cholesterol, we computed five per- 
centile ranks. We briefly discussed these five percentile ranks and 
the information that they provided. The box and whisker plot is a 
graphic method for displaying these five percentile points. 


*> Choose None from the Describe menu. This deselects the 
previously checked descriptive statistics. 


*> Choose Percentiles from the Describe menu. 


‘> Choose Box Plot from the View menu. 
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This box plot is derived from the five percentiles. The top of the box 
represents the 75th percentile, a Cholesterol value of 212. The 
bottom of the box represents the 25th percentile, a Cholesterol value 
of 168.5. The middle 50% of the Cholesterol values are contained 
within the span defined by the box boundaries. The line in the 
middle of the box represents the median for Cholesterol, 191. If the 
distribution is symmetric the median will be in the exact middle of 
the box. The lines extending above and below the box are referred to 
as “whiskers”. The top whisker is drawn from the Cholesterol value 
associated with the 75th percentile, 212, to the Cholesterol value 
associated with the 90th percentile, 233. The bottom whisker is 
drawn from the Cholesterol value associated with the 25th percentile, 
168.5, to the Cholesterol value associated with the 10th percentile, 
142. 


The small, somewhat overlapped circles under the lower whisker and 
seven small circles above the upper whisker represent observed 
values below and above the 10th and 90th percentile values, 
respectively. Extreme values are clearly apparent with box plots, as 
are outliers. StatView II always shows the highest 10% of observed 
values and the lowest 10% of observed values. When an observed 
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value 1s an outlier, it will appear to be farther removed from the 
whisker than the other extreme values. For our Cholesterol data, the 
box plots make it clear that the high Cholesterol values are more 
extreme than the low Cholesterol values. 


Two controls appear on this view which are unique to box plots. The 
first contro] changes the box plot to a notched box plot. The notches 
represent 95% confidence bands about the median. 


¢> Click the notch. A notched box plot appears: 


The notches are quite narrow on this plot. The confidence band for 
the median ranges from approximetely 190 to 200. Our previous 
discussion regarding confidence bands may also be applied here with 
the exception that we are dealing with a median rather than a mean. 
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The second contro] eliminates the representation of the the extreme 
twenty percent of the observed values, the 10% below the 10th 
percentile and the 10% above the 90th percentile: 


The Cumulative Frequency Curve 

It is sometimes informative to plot observed values against their per- 
centiles in a percentiles plot. The percentiles plot displays a 
cumulative frequency curve. 


To obtain a percentile plot you must have Percentiles selected in the 
Describe menu. 


«> Choose Percentiles from the Describe menu. 


‘> Choose Scattergram from the View menu 


Chapter 4 — The Describe Menu 137 


~ 


L3 


8 


The cumulative frequency curve allows you to very quickly estimate 
the percentile associated with any observed value in a distribution. 
The ordinate represents observed values, cholesterol, and that the 
abscissa represents the values from 0 to 100. These abscissa values 
represent the cumulative percent below the upper limit of an 
observed value.The percentile plot will rise from lower left to upper 
right, regardless of the observed values used. 


*> Click the bottom control of the view control panel, which is 
unique to percentile plots. 


This will cause horizontal lines to be displayed on the plot, 
representing the five percentile values (10th, 25th, 50th, 75th, and 
90th percentiles) indicated in the table view and used with the box 
and whisker plot. 


Percentiles Plot for column: X; Cholesterol 
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If we wish to estimate the percentile associated with an observed 
Cholesterol count of 200, we simply scan across to the point or 
points associated with 200 and then scan down to the percentile 
associated with them, which is approximately the 58th percentile in 
this case. We would conclude that approximately 58 percent of the 
sample have Cholesterol levels below 200. 


Notice that this percentile plot associated with our mesokurtic 
Cholesterol distribution has a minor middle “hump” . The plot is a 
slowly rising line except at the upper right. This slow rise suggests 
that the associated distribution is essentially mesokurtic. 
Furthermore, the abrupt increase in the percentile plot slope in the 
upper right suggests that a very small change in percentile is 
associated with a rather large change in observed value. Some of the 
most extreme high Cholesterol values are substantially higher than 
the rest of the Cholesterol values. If there had been an abrupt upward 
change at the lower end of the plot it would have been suggestive of 
some low outlier; that a small change in percentile value at the low 
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end of the Cholestero] continuum is associated with a large change in 
observed Cholesterol value. 


There are certain implicit properties built into this view. We know 
that, by definition, half of the observed values will be to the right of 
the 50th percentile and that half of the observed values will be to the 
left of the 5Oth percentile. Furthermore, we know that the graphed 
points will go from the lower left to the upper right. We also know 
that if a distribution is normal, the relative frequency curve will be 
an S-shaped, sigmoid, curve. Deviation from the S-shape allows us 
to make certain inferences about the shape of the associated 
distribution. To aid in an interpretation of the percentile plots, we 
plot below cumulative frequency curves of four distributions: 
normal, platykurtic symmetric, positively skewed, and negatively 
skewed. 


Normal Distribution Curve 
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For a normal distribution, a small change in an observed value near 
the center of the distribution will be associated with a relatively large 
percentile change. A change in an observed value at either the low or 
high end of the observed value distribution will result in only small 
percentile changes. Therefore, the relative frequency curve of 
observed values rises most rapidly at the low and high ends of the 
percentile range and has a plateau in the middle part. 


Symmetric Platykurtic Standard Scores 
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For a symmetric, platykurtic distribution, a change of an observed 
value at almost any location will cause a relatively constant change 
in the percentile (except at the extreme ends of the distribution, 
where large observed value change will be associated with small 
percentile change). Therefore, the relative frequency curve should 
rise at a constant rate. 


Positively Skewed Standard Scores 


Standard Score 
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For a positively-skewed distribution a change of one observed value 
unit near the lower end of the distribution will be associated with 
relatively large percentile changes. As you move toward the nght tail 
of the distribution, a change of one observed value unit will have less 
and less percentile change associated with it. Thus, a relative 
frequency curve associated with a positively skewed distribution will 
rise Slowly from lower left to the upper right with the rise becoming 
more apparent as the curve approaches the nght side of the graph. 


Megatively Skewed Standard Scores 
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For a negatively-skewed distribution a change of one observed value 
unit near the lower end of the distribution will be associated with 
relatively small percenule changes. As you move toward the nght 
tail of the distribution, a change of one observed value unit will have 
more and more percentile change associated with it. Thus, a relative 
frequency curve associated with a negatively skewed distnbution 
will nse rapidly from lower left to the upper nght with the nse 
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becoming less apparent as the curve approaches the right side of the 
graph. 


The normal distnbution is a mesokurtic curve, and you should 
therefore think of the percentile plot associated with mesokurtic 
distributions as being sigmoid. The other charts shown above are 
leptokurtic curves and are characterized by a cumulative frequency 
curve that shows a very steep slope as the plot rises from lower left 
to upper nght. 


At times you may have so few observed values in a particular range 
that it is difficult to ascertain the slope of the percentile plot. In such 
a Situation it 1s helpful to have all points connected by a line. 


«> Choose Line Chart from the View menu. 


Connected Percentiles for column: X; Cholesterol] 
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The resulting percentile plot is identical to the previous percentile 
plot except that it also has a line going through all of the plotted 


points. 
«> Choose Bar Chart from the View menu. 
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The resulting plot represents still another way of looking at the 
percentile plot. Rather than plotting the points, a bar is dropped from 
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Graphic Comparisons 


what would be the center of a point to the percentile baseline. As 
may be clear, these are two alternative graphic methods for viewing 
percentile data. The graphic method you choose is a function of your 
preference. 


Summary 


This discussion has shown how StatView II allows observation of 
the statistical variation of a variable in several different ways. We 
started by computing the mean, standard deviation and other 
summiary Statistics for the Cholesterol data, observing the results first 
in tabular and then in graphic form. 


Next, percentiles were used as a reference point for our analysis. 
Once again both tables and graphs were observed. Although we 
looked at the mean, standard deviation and other summary statistics 
first, and then the percentiles, you may chose all or any combination 
of the descriptive statistics and percentiles simultaneously. 


StatView Il computes descriptive statistics for as many X variables 
as you specify. More importantly, StatView II graphic features let 
you compare the distributions of your variables through composite 
graphs. The graphs include error bars and box plots. 


Making Descriptive Comparisons Around the Mean 


Virtually all of the graphic comparison options of StatView II 
assume that variables selected for comparison have a common unit 
of measure. Thus, if you attempted to compare Systolic BP with 
Cholesterol, you would get a graphic comparison with the unit of 
measurement being a combination of both variables. The lower end 
of the continuum would be determined by Systolic BP while the 
upper end of the continuum would be defined by Cholesterol. The 
comparative graphic representations would be impossible to 
interpret. 


In this example, we will not compare two of the existing variables in 
the dataset. Instead, we will take the Cholesterol vanable and split it 
into two cholesterol measurements, one for females and one for 
males. StatView II makes it especially easy to split variables on the 
basis of associated categorical vanables. 


‘> Choose Split Columns from the Tools menu. A dialog box 
appears with the split key list on the left and the column to be 
split list on the nght. 
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Create New DataSet by Splitting Columns 


Select Column(s) to Split: 


Smoking History Gender 

Alocohol use Age 

Heart History Weight 

Cholesterol 
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7 ideal body wt. re 





«> Select Gender as the split key and Cholesterol as the column to 


split. 


«> Select Done to create the dataset. 


When the dialog box disappears you will note a new dataset 
Untitled-1. This dataset contains two columns, the first representing 
the cholesterol measurements of the male subjects, the second the 
cholesterol measurements of the females: 


a 


na 


°?> 


“> 


SaaS Untitled-] — 





Assign male cholesterol] as X, and female cholesterol as X>. 
Choose Mean, Std. Dev., etc. from the Describe menu. 


Choose Confidence Intervals from the Describe menu and 
select t-distribution with 95% intervals. 


Choose Table from the View menu, and a table view appears 
showing the results for the X, column, male cholesterol: 
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*> Click the bottom arrow in the scroll bar to view the X, column. 


You see the results for the next column, the X, column, female 
cholesterol. 


%2: female - Cholesterol 
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While we could impose interpretations on these data similar to those 
imposed in the individual variable description section, it will turn out 
that the graphic views of X, and X, together will be much easier to 
interpret, from a comparative perspective, than the table views. Note 
that when StatView II split the data values, it created a dataset with 
enough rows to handle the 71 male values. Because there are only 
24 female values, the remaining rows (47) are filled with missing 
values. 


*> Choose Scattergram from the View menu. Make sure that the 
composite/paging control is set for a composite view, showing 
both columns. 


95% Error Bars for Columns: X_ ... 
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This view shows a single-tiered error bar for each variable. The ends 
of the bars represent the 95% confidence intervals, and are marked 
by the short horizontal lines. The means for these two groups, 
190.085 for the males and 194.625 for the females, appear to be 
different, but the error bars overlap, indicating that the two 
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cholesterol samples could be from the same population. It is 
reasonable to conclude that the difference of 4.54 that we see 
between the two means is just a consequence of chance. 


More precisely, that if we repeatedly sampled cholesterol from the 
female population, using samples of size 24, and constructed 95% 
confidence intervals for each sample, we would expect that 95% of 
the confidence intervals would span the actual population mean 
cholesterol level, and that 5% of them would not span the actual 
population mean. It is apparent that the female cholesterol 
confidence band completely spans the male cholesterol confidence 
band. If we were to do a t-test (which we will do later), we might 
conclude that the female cholesterol mean is not significantly 
(p<.05) different from the male cholesterol mean. 


*> Choose Confidence Intervals from the Describe menu. 


*> Click Remove to remove the confidence intervals from the 
selected descriptive statistics. 


The view shows a single-tiered standard deviation bar for each 
variable. The center of each bar represents the mean of the associated 
variable. This view allows you to quickly look at the two variables to 
determine whether or not the two associated distributions might 
overlap with each other on the same continuum. It would appear 
from our scattergram that the male and female Cholesterol 
distributions overlap almost completely. 
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A much more informative view may be seen by clicking the 
composite/paging too] to view a separate univariate scattergram for 
each variable. We have discussed the interpretation of the univariate 
scattergram in the graphic section for individual variables. It is 
possible to page through the view windows for the two variables to 
compare the univariate scattergrams for outliers and the extent to 
which the two variables exhibit different skewness. Notice that for 
the females about 67 percent of the observed values, 14 observed 
values, are below the mean. This would imply that the female 
cholesterol distribution 1s positively skewed. This observation can be 
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confirmed by computing the coefficient of skewness for these data, 
906. 


Scattergram for column: 42 female - Cholesterol 
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The male distribution, however, appears to be almost symmetric, 
with 36 observed values below the mean and 35 observed values 
above the mean. This can be confirmed by computing the index of 
skewness for the male Cholesterol data. This index is .063, while a 
symmetric distribution would be 0. 
_ Scattergram for column: X; male - Cholesterol 
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These two scattergrams are difficult to compare because they differ 
in sample size. The male sample being much larger than the female 
sample gives the erroneous visual impression that the males are 
much more heterogeneous than the females. If the samples were of 
approximately the same size such a visual impression might tend to 
be valid. 


It is difficult to understand the difference in variance between the 
males and the females when comparing their scattergrams. For some 
datasets there will be very clear subgroupings of the observed values 
either at the mean, above the mean, or below the mean. Such 
groupings suggest the existence of subgroups within the dataset. 
However, these scattergrams suggest that both the male and female 
subgroups tend not to have subgroups with regard to Cholesterol 
count. 
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*> Choose Bar Chart from the View menu. Once again make sure 
that the paging/composite tool is set for a composite view. 


This chart compares the variable means: 


Bar Chart of Column Means: X; .. x2 
2006 


180 
160 
140 
120 
106 
80 
60 
40 
20 
0 


Units 


male = Cholestero} female = Cholestero] 
Columns 


*> Click the composite/paging tool, and the view displays the z- 
score distribution for each variable grouping. 


As noted in the discussion of single group graphics, the bars drawn 
represent a standard deviation unit. The bar between 0 and 1 
represents a frequency count of the number of points between the 
mean and one standard deviation above the mean, that 1s, with z- 
scores from 0 to 1. Also as previously noted this view gives us a very 
rough picture of the variable distribution. The male and female z- 
distributions can be observed by paging through the two pages. 


@ Score of X;: male - Cholesterol 
30 


25 


5 15 Cimale - Chole... 
oO 


0 
< “$$ -2 =] 0 1 2 3 > 
Z Scale 


These two views of the standard scores are somewhat redundant 
when considering the conclusions that we have already made. For 
the z-scale bar chart associated with the male Cholesterol count, 
there is no discernable skewness, as the distribution is almost 
symmetric. 


For the bar chart associated with the female Cholestero! distribution 


It is very easy to note the positive skewness. Notice that the right 
side of the distribution extends from 0 to 3 while the left side of the 
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distribution extends only from 0 to -2 . Clearly the nght side of the 
distribution is longer than the left side. 


16 


14 


12 


10 


Count 
@ 


Z Score of Xa: female - Cholesterol 


Ol female - Chol... 


Z Scale 


*> Choose Table from the View menu. 


*> Choose None from the Describe menu, clearing previously 
selected statistics. 


*> Choose Percentiles from the Describe menu. 


The table view shows the results for the X; variable, male 
Cholesterol. By turning the pages, you see the results for the female 
Cholesterol: 


* < 10th S: 


* > 90th BS: 


X2: female - Cholesterol 
10th S: 25th B: SOth BS: 75th BS: 90th B: 
ts96 174 ftg9s 20s 259.9 





a RS ee ae: A 


*> Choose Box Plot from the View menu. Make sure that the 
paging/composite tool is set for composite. 


Note that in a box plot the total box plot is always contained in the 
view; therefore, visual comparisons of the box lengths will be a valid 


procedure. 
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Units 


Box Plots for columns: X; ~. Xz 
300 


280 9 
260 
240 
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e000 8 


180 
160 


120 3 


100 
male - Cholesterol female - Cholesterol 


Columns 


This composite view shows the box plots for each of the X variables 
selected: in our example, the male and female Cholesterol levels. 
With these two box plots a great deal of comparative information 
may be seen. You will recall that we discussed box plots in detail 
when we discussed the graphic description of a single variable. 


Because the box associated with the males is “thicker”, we can 
conclude that the middle 50 percent of the Cholesterol distribution 
from the male sample 1s more heterogeneous than the middle 50 
percent associated with the female sample. Notice the difference in 
whisker lengths for the females and the similarity in whisker lengths 
for the males. The fact that the whisker associated with the higher 
female Cholestero] counts is clearly longer than the whisker 
associated with the lower female Cholesterol counts suggests a 
positively skewed distribution.There will most likely be a longer tail 
at the upper end of the female Cholesterol distribution than at the 
lower end. 


Notice also that the extreme Cholestero] counts for the men are close 
to the whisker ends, whereas the extreme Cholesterol counts for the 
females are somewhat removed from the ends of the whiskers. These 
female outliers, when considered within the context of the 
homogeneity of the middle 50 percent of the female distribution, 
provide an explanation for the greater female variability. As a group, 
females tend to have a homogeneous Cholesterol count, but those 
females who have extreme Cholesterol] counts, either high or low, 
tend to be very extreme. 


This explains why the standard deviation for female Cholesterol is 
larger than the standard deviation for male Cholesterol. It is possible 
that the standard deviations for these datasets are misleading. For a 
majority of the male and female samples, the females tend to be 
more homogeneous than the males, as is demonstrated by the 
females’ “slimmer” box in the box plot. 


*> Click the composite/paging tool, and the view window displays 
the box plot for each variable on a separate page. 


*> Click the composite/paging tool again and you are returned to the 
composite view. 


*> Click the notch tool. The resulting notched box plots allow you 
to make a quick visual comparison of the male Cholesterol 
median confidence band with the female Cholestero] median 
confidence band: 
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Notched Box Plots for columns: X; ~. X2 
300 


280 ° 
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90000 
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male = Cholestero] female - Cholesterol 
Columns 


The width of the notches suggests that the two median confidence 
bands are not only approximately the same width but also 
overlapping almost completely. The medians may be assumed to be 
the same. 


Three other views of percentiles are available: scattergram, bar, and 
line chart. These percentiles plots may be a composite or may have 
only one variable per page. You may use the composite/paging tool 
to switch between the composite and single variable presentations. 


*> Choose Line Chart from the View menu. 


Connected Percentiles for columns: X1 .. X2 


Omale- Chole... 
C femaie - Chol... 


male - Cholesterol 





0 20 40 60 80 100 
Percentile 


This view is an effective descriptive comparison between male and 
female Cholesterol. Note that the ordinate represents observed 
values, Cholesterol level, and that the abscissa represents percentiles. 
This type of plot was discussed in detail in the univariate description 
section. 


Notice that the two percentile plots are reasonably comparable. The 
female plot rises more rapidly than the male plot at both the low and 
high ends of the Cholesterol continuum. This would suggest that the 
female distribution is more mesokurtic than the male distribution. 
Furthermore the gradual, almost unchanging slope associated with 
the male plot also suggests that the male plot may tend to be a bit 


Chapter 4 — The Describe Menu 


Frequency 
Distribution 


platykurtic. The very abrupt increase in slope at the end of the 
female plot may be attributed to the female outliers. 


Ideally, if two distnbutions are similar they would have been 
identical in appearance, but not necessarily superimposed. 


Summary 


These examples have shown how the graphic features of StatView I] 
can compare either several variables or several samples on a single 
variable. The first example used the mean as a Starting point for 
observing Cholesterol variation among males and females. Next, 
percentiles were used as a reference point for our comparisons. In 
each case, the variation between the variables was made clear 
through different graphic views of error bars and box plots. 


Once again, it 1s important to understand that while we looked at the 
mean Statistics first and then the percentiles, you may choose to 
compute one or all of the descriptive statistics. 


*> Choose Table from the View menu. 


*> Choose all of the descriptive statistics except Frequency 
Distribution. 


The table includes the results for all selected statistics. Each page 
of the view window contains the results for a different X 
variable. All pages of the results may be printed. 


StatView II computes frequency distributions for X variables. 
Frequency distributions treat category and continuous (integer, long 
integer, and real) data differently. The first example 1s of the 
Cholesterol distribution which 1s defined by data that are continuous. 


Frequency Distribution of Continuous Data Variables 


*> Choose Frequency Distribution from the Describe menu, and 
the following dialog box appears: 
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Frequency Distribution Parameters 


Number of Intervals: 


Do you wish to enter your own interval width?: 


@®no Oyes width:[ initial value:[| | 


For Each Interval: 
@ Include lowest value © Include highest vaiue 
Intervals Indicate: @ count © cumulative 


(ok) (Cancel ) 





The text entry box labelled Number of Intervals allows you to | 
specify the number of intervals in the distribution to be plotted. The 
maximum number of intervals that may be entered is 1000. 


The next parameter determines the interval width. If you do not wish 
to enter your own width, StatView II creates a width equal to the 
data range plus one divided by the number of intervals. If you do 
wish to enter your own interval width, enter the width and the initial 
value in the respective text entry rectangles. 


The next parameter in the dialog box determines whether a given 
interval includes the lowest value in the interval or the highest value. 
If you decide to include the lowest value, each interval contains the 
count of data values that are greater than or equal to the lower limit 
of the interval and less than the upper limit of the interval. If you 
decide to include the highest value, each interval contains the count 
of data values that are greater than the lower value and less than or 
equal to the upper value. 


The final parameter determines whether the intervals indicate the 
count of values or the cumulative count of values. You specify 
whether you want to have the height of a bar determined by the 
frequency of observed scores in the interval or by the cumulative 
frequency of scores up to the upper limit of the interval. 


*> Click OK. 


*> Click Table from the View menu. 


X%1: Chelesteroil 





om 
i 
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This view 1s a tabular summary of our total Cholesterol data as it will 
be used to construct a histogram. The range used is 171 (one greater 
than the range of the data in order to include the endpoints) , highest 
observed value plus one (286) minus lowest observed value (115). 
The default options define the interval length as (171/10) or 17.1 
units. The intervals are then constructed from the lowest score to the 
highest score. The table summarizes the histogram bars in terms of 
the frequency within the bar and the percentage associated with the 
bar. Either summary may be used to determine the heights of the 
bars in the histogram. If there is a modal interval, it is labeled. 


*> Choose Bar Chart from the View menu. 
Histogram of X; : Cholesterol 
22.5 
20 


D Cholestero) 


Count 


) 
100 120 140 160 180 200 220 240 260 280 300 


Cholesterol] 


You now have a graphic representation of the table. If you have 
selected more than one X variable, scrolling the pages will show you 
the histograms for successive X variables. 


It is also possible to represent the data cumulatively. Cumulative 
data are defined by intervals whose counts represent the frequency of 
observed values below the upper limit of the associated bar. Such a 
distribution is very similar to a percentile plot, but the axes differ. 
The bars increase in height from lower left to upper right. The 
ordinate represents frequencies. 


‘> Choose Frequency Distribution from the Describe menu. 
*> Click Cumulative. 


*> Click OK. 
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Histogram of X; : Chelesterel 
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The interpretation of this cumulative histogram is quite similar to the 
interpretation of the percentile plot. 


Now suppose that you want the most accurate graphic portrayal 
possible for a given set of data. Ideally you would want a histogram 
with a bar corresponding to every observed value in the dataset. 


*> Choose None from the Describe menu to clear all statistics. 
*> Choose Mean, Std. Dev., etc. from the Describe menu. 
*> Choose Table from the View menu. 


The table view provides you with the basic descriptive data for 
Cholesterol. The range, 170 for Cholesterol, provides some sense of 
how many bars will be required to represent all possible observed 
values. For our Cholesterol data, each observation is measured to the 
nearest unit, and therefore we will need 170 bars. If our data were 
measured to the nearest .5, we would require 340 bars. Each bar 
should span from the real lower limit to the real upper limit of the 
observed value associated with it. The table view also shows us that 
the lowest value in the Cholesterol distribution is 115. The real lower 
limit of 115 is 114.5 and the real upper limit of 115 is 115.5. If we 
select an interval width of 1, for 1 Cholesterol unit, and use 114.5 as 
Our initial value, then we will need 170 intervals to cover the range 
of values from 115 to 285, the maximum observed Cholesterol 
value. 


‘> Choose Frequency Distribution from the Describe menu. 
*> Enter 170 for the Number of Intervals. 

*> Enter | for the interval width. 

*> Enter 114.5 as the initial value. 

‘> Click Include lowest value. 

*> Click Count. 

*> Click OK. The resulting histogram has a bar to represent each 


unique observed value and represents every peak and trough in 
the data: 
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Histogram of X; : Cholesterol 
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Because there are gaps in the data (that is, possible observed values 
that do not occur in the dataset), some of the histogram bars appear 
to be a bit isolated and have wide troughs. 


The histogram based on cumulative frequencies looks much better 
than than the histogram based on simple counts. 


*> Choose Frequency Distribution from the Describe menu. 
‘> Click Cumulative. 


-e> Click OK. 


Histogram of X; : Cholesterol 


BD Cholestero! 





0 _ ) Bonk. 
100 120 140 160 180 200 220 240 260 280 300 


Cholestero] 
The data gaps are now represented by the rough surface of the 
distribution. 
Frequency Distribution of Category Data Variables 
Suppose we wished to look at heart history: whether there has been a 


history of heart attack and the general age at which it occurred. This 
is a categorical vanable and has four categories: none for no history 
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of attack, <50 indicating at least one heart attack in the family prior 
to the age of 50, <60 at least one heart attack in the family prior to 
age 60 but after age 50, >60 at least one heart attack in the family 
after age 60. 


*> Clear the X variable by double-clicking Cholesterol. 
*> Assign X to the column Heart History. 
*> Choose Frequency Distribution from the Describe menu. 


The value for the number of intervals defaults to the highest number 
of elements contained in the category column selected. If the number 
of intervals selected is greater than the number of elements, 
StatView II only displays the information on the category elements. 
If the number of intervals chosen is less than the number of elements 
in the category set, StatView II displays information on only the 
selected number of intervals, and indicates with a message that the 
category set contains more elements than are being displayed. 
Interval width and lowest/highest value do not apply for category 
variables and their settings are ignored. You may choose to display 
count or cumulative information. 


‘> Choose Count. 
*> Choose OK. 


*> Choose Pie Chart from the view window. 


Pie Chart of X, : Heart History 
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Notice that the order of the categories in the pie chart corresponds to 
their order in the legend. The order of the categories in the legend is 
a function of their order at data entry. The category of “none” 
corresponds to “element 1” at the time that the categories were 
defined. In a somewhat analogous fashion the “over 60” category 
corresponded to “element 4” at the time that the categories were 
defined. 


You can very quickly estimate that approximately 60-75% of the 
sample has no history of heart attack. It is also possible to conclude 
that the proportion of respondents with a history of heart attacks 
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prior to age 50 is approximately the same as the proportion of 
respondents with a history of heart attacks before 60 but after 50. 
You may also conclude from the pie chart that the proportion of 
respondents with a history of heart attacks after 60 1s about equal to 
the proportion of respondents with a history of heart attacks before 
60. 


*> Choose Bar Chart from the View menu. 


Histogram of X; : Heart History 
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You now have a histogram view of the heart history data. It would 
be more difficult to make comparative statements with the histogram 
than with the pie chart. It is a matter of personal preference whether 
you prefer bar charts or pie charts. StatView II allows you to use 
either or both. 


*> Choose Table from the View menu. 


The table displays the category element name, the frequency with 
which that element appears in the current X column, and the 
percentage of the data it represents. 


X14: Heart History 
Bar: Element: Count: Percent: 


a Ce a 16.842 





Although pie charts are available for all frequency distribution data, 
they make the most sense for category data. Such data have names 
associated with regions rather than bar numbers. It is very difficult to 
recall what a bar represents without having a copy of either the 
histogram table or the histogram chart. 


The logic of a pie chart is to represent the bars in such a fashion that 


the proportion of the whole associated with any one category 1S 
readily seen. It also facilitates visual comparisons among the bars. 
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The visual complexity of a pie chart increases with the number of 
bars associated with it. 


Summary 


StatView II allows you to create frequency distributions of your data 
with up to 1000 bars. The interval widths, initial values, whether to 
include observed values equal to the lower limit or the upper limit of 
the interval in the count, and whether the histogram is a frequency 
count or cumulative count can all be specified. 


The frequency information can be viewed in both tabular and 
graphic forms. The graphic forms include bar charts, also referred to 
as histograms, and-pie charts. 


Finally, StatView IT recognizes the difference between category and 


continuous data in determining and displaying frequency 
distributions. 
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Compare perce..tles 


Chapter 5 — The Compare Menu 


The following discussion refers to information from the Learning 
StatView I] chapter. You should be familiar with the concepts 
discussed in the following sections of that chapter: Assigning 
Variables, Graphing Data, Selecting a Statistic, and Viewing 
Results. 


This discussion of the statistics in the Compare menu uses several 
data sets found in the Sample Datasets folder. For each statistic 
example, we state which datasets are used; make sure the correct 
dataset 1s open. 


The statistics are presented here in the order they appear on the 
menus. 


The Compare Percentiles command compares 19 corresponding 
percentiles of two variables. The percentiles compared are 1, 2, 3 ,4, 
5,10, 20, 30, 40, 50, 60, 70, 80, 90, 95, 96, 97, 98, 99. 


In particular, this procedure allows the user to compare two data 
distributions. The mth percentile of a dataset is the value such that 
no of the data are less than or equal to the value. The percentiles of 
one distribution are plotted against the corresponding percentiles of a 
second distribution. Usually such comparisons are made between 
distributions that represent measures on the same variable. The graph 
is always drawn as a box with the abscissa defined by the X variable 
and the ordinate defined by the Y variable. 


If the two distributions are comparable, the 19 plotted points will fall 
on a Straight line passing from the lower left corner to the upper right 
comer of the graph. If a point is above the line then the sample 
associated with the ordinate has a higher score associated with the 
given percentile point than the sample associated with the abscissa. 
Alternatively, if a point is below the line, the sample associated with 
the abscissa has a higher score associated with the point than the 
sample associated with the ordinate. In general, this percentile 
comparison graph is most informative with regard comparing the 
distribution forms of a variable in two samples. They may be either 
the same group or two different groups. 


This statistic has no dialog box. 
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Assigning Variables 


The Compare Percentiles command compares the corresponding 
percentiles between X and Y variables. It isa OneXOneY statistic. If 
a single X variable and multiple Y variables are assigned, a result is 
calculated for each Y variable. If there is a single Y variable and 
multiple X variables assigned, a result is calculated for each X 
variable. If there are multiple X and Y variables assigned, a result is 
calculated for each X;- Yj pair (matching subscripts). 


For example, compare the Cholesterol Distribution for males to the 
Cholesterol Distribution for females. 


*> Choose Split Columns from the Tools menu. 
Create a new dataset using Gender as the split key and 
Cholesterol as the column to split. (See Chapter 6 for more 
information on splitting columns.) 


*> Assign an X to the Male-Cholesterol column. 


*> Assign a Y to the Female-Cholesterol column. 


Table View 


*> Choose Table from the View menu. 


Percentile Comparison for Xj: male - Cholesterol 


% male- C...female-... 


2 |ii96 [ist 
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Y1: female - Chole... 


% male- C...female-... 


90 232.4 [259.9 
95 |250.55__|279.4 
96 |255.62_|281.32_| 
98 [267.8 [285 
99 |274.9 [28s 















so[2is [213.8 


This table view provides all the information that will be used to plot 
the percentile comparison graph. The names of the X and Y variables 





.analyzed are shown in the view title. The percentile compared is 


indicated in the % column. The 19 variable percentile values are 
displayed in their respective columns. 


Notice that below the median, the 50th percentile, the female 
Cholesterol count is higher than the male Cholesterol count at every 
percentile. However, between the 50th and 80th percentiles the male 
Cholesterol count is higher than the female Cholesterol count. At the 
upper extreme, the 90th to the 99th percentile the females once again 
exceed the males. 


Graphic Views 


‘> Choose Scattergram from the View menu. 
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t-Test 


Percentile Comparison Graph for columns: X11 
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The graph drawn is a square graph; both axes are equal in length and 
the line x=y is drawn. The points plotted are the 19 comparison 
percentiles listed above. At both extremes of the line, the points are 
above it, suggesting that the females have higher Cholesterol counts 
at the extremes than do the males. You can see the data converging 
on the line near the median.This convergence suggests that the 
middles of the two distributions are similar. 


The Scattergram view contains a tool specific to the statistic, the 
unequal axis tool. It appears as the sixth control from the top. 


Clicking on this control rescales the display to use the maximum 
display. The graph will no longer be square and the x=y reference 
line is removed. Both of these graphs, the square and non-square 
form, may be viewed as Line Charts. 


Frequently, you will want to compare two means to see if they are 
comparable. Furthermore, when making such a comparison, you may 
also be questioning whether or not it is reasonable to assume that the 
samples from which the means were computed could have come 
from the same population. StatView II provides three different types 
of t-Tests: one group t-Test, paired two group tests, and unpaired two 


group tests. 


*> Choose t-Test from the Compare menu, and the t-test dialog 
box 1s displayed. 
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Select t-Test: 
@ One Group t-Test 


Two Group Tests: 


© Paired ©) Unpaired 
t Values: © one tail @ two tail 





You can specify which t-Test to calculate. You can also specify 
whether the probability values correspond:to one- or two-tail tests of 
significance. 


One Group t-Test 


The one group t-Test compares a sample mean to a hypothesized 
population mean. It is a OneX statistic. A result is calculated for each 
X vanable assigned. 


The hypothesis tested in this example is that the observed sample 
mean is consistent with the mean specified in the dialog box. Our 
dataset, Lipid Data, is based on 95 subjects. These subjects were 
educated with regard to the relationship between Cholesterol count 
and diet. 43 of the subjects had Cholesterol samples taken three years 
after their initial measures. We have ordered these subjects in our 
dataset so that they are the first 43 subjects. The last column of the 
dataset is Cholesterol Loss. It represents the difference between the 
initial Cholesterol measure and the subsequent Cholesterol measure 
some three years later (see Formula in Chapter 6 to learn how to 
create a new column by formula from variables in your dataset). 


The average Cholesterol loss is 9.767 units. Unfortunately, not all 
subjects showed a reduction in Cholesterol count. Indeed, some 
subjects elevated their Cholesterol by as much as 73 units while 
others reduced their Cholesterol by as much as 62 units. A researcher 
might question whether the average Cholesterol Loss of 9.767 units 
is just chance variation. That is, can we assume that 9.767 is 
significantly greater than 0? 


This is a one-tail test. Had we asked the more fundamental, and 
somewhat uninformed question, of whether or not 9.767 is different 
from 0 we would be implying a two-tail test. We do not know either 
the population mean or the population standard deviation with regard 
to Cholesterol Loss. The one group t-Test will use our sample 
standard deviation for Cholesterol Loss, 27.627, as an estimate of the 
population standard deviation. On the basis of a sample of size 43 
with a standard deviation of 27.627, is it possible that the sample 
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mean of 9.727 units does not represent a real cholesterol loss and is 
just a chance fluctuation from 0? 


Using our Cholesterol] Dataset let us assign X to the Cholesterol Loss 
column and select the correct t-test. 


*> Assign an X variable, X1, to the Cholesterol Loss column. 


*> Choose t-Test from the Compare menu to display the t-Test 
dialog box. 


*> Click One Group t-Test. 
*> Click One tail. Click OK. 
*> Select Table View from the View menu. This dialog box appears 


prompting you to enter the population mean. 


Enter the population 
mean: 





«> Enter 0 and click OK. You see: 


One Sample t-Test X11: Cholesterol loss 


DF: Sample Mean: Pop. Mean: t Value: Prob. (1-tail): 








Note: 52 cases deleted with missing values. 


This one group test essentially addresses the question of whether or 
not a sample of size 43 with a mean of 9.767 could have been 
selected from a population that actually had a mean of 0. The table 
tells you that it 1s quite improbable that this sample was drawn from 
a population with a mean of 0. Specifically, the probability of 
obtaining a mean of 9.767 or larger just by chance alone if the 
population mean is 0 is .0127. Assume that you are operating at the 
.O5 significance level, then you will assume that the mean of 9.767 
represents a real positive difference because the reported probability 
level, .01217, is a probability that is less than .05. The dieting is 
probably effective in reducing cholesterol] counts for the sample. 


No graphic views are available. 


Paired Two Group t-Test 


The paired two group t-Test computes a paired t value between an X 
and Y column, where each row entry for both columns is assumed to 
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be a measure on the same subject . Sometimes this test is also 
referred to as a dependent sample or repeated measures t-test. It is a 
OneXOneY statistic. If a single X variable and multiple Y variables 
are assigned, the X is compared to each Y variable. If there isa — 
single Y variable and multiple X variables assigned, a result is 
calculated for each X variable. If there are multiple X and Y 
variables assigned, a result is calculated for each Xj- Yj pair 


(matching subscripts). 


The paired two group t-Test compares an X mean and a Y mean 
determined from either the same sample of respondents or two 
samples that are known to be dependent. The null hypothesis 
assumes that the two means have been defined by samples from the 
same population, and therefore we would expect the difference 
between the means to be 0. The analysis addresses the question of 
whether or not the observed difference between the two means is a 
chance difference. This difference might either be larger than we 
might expect by chance, which is a one-tail test, or be more extreme 
than we might expect by chance, which 1s a two-tail test. 


We have many vanable measures on our 95 subjects, but it would be 
unreasonable to make compansons between variables that represent 
very different measures. For instance, if we were to compare the 
Systolic BP with Cholesterol variables, we would most certainly 
find a statistically significant difference, but it wouldn't have any 
substantive meaning. In our discussion of the one group test, we 
noted that 43 subjects receiving dietary education reduced their 
Cholesterol count over.a three-year period. You might question how 
their diet modification affected their weight. Did the diet 
modification also bring about a weight loss? Is their weight at the 
end of three years significantly less than it was when they began 
their diet modification? This is a one-tail hypothesis that specifically 
questions whether the mean weight associated with the third year 
measure is statistically less than the mean weight associated with the 
initial measure. We are using the same group of subjects for the 
computation of both means. 


*> Assign an X variable, Xj, to the Weight column. 
*> Assign a Y variable, Y), to the Weight-3yr column. 


*> Choose t-Test from the Compare menu to display the t-Test 
dialog box. 


*> Click Paired. Click One tail. Click OK. 


«> Choose Table from the View menu, and the following view 
appears: 
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Paired t-Test X14: Weight Yj 1: Weight-3yr 


Mean X-Y: Pairedt value: Prob. (2-tail): 


Note: 52 cases deleted with missing values. 


The mean weight associated with the initial measures was 
approximately 164.5 whereas the mean weight associated with the 
measures taken some three years later was 166.5. You do not even 
need a Statistical test to tell you that the sample did not lose weight. 
You can compare the two means visually and immediately see that 
the second mean 1s certainly not less than the first mean. The 
associated probability is .0634. Assume that you are operating at the 
.O5 significance level, then you will assume that the mean of 166.5 is 
not less than the mean of 164.5 and that any difference between the 
third year mean weight and the initial mean weight is due solely to 
chance fluctuation. Evidently, the dieting, while effective in reducing 
cholesterol counts for the sample, does not tend to reduce weight for 
the sample. 


No graphic views are available. 


Unpaired Two Group t-Test 


The unpaired two group t-Test computes an unpaired t value 
comparing the means between two groups in a single Y column. The 
groups in the Y column are specified by an X column which must be 
either a Category or Integer column. The X column represents the 
independent variable and must have no more than two levels or 
groupings associated with it or else you will receive an error 
message. If you have more than two levels or groupings associated 
with X you should consider a one-way analysis of variance. 
Sometimes this test is also referred to as an independent sample or 
non-repeated measures t-test. Itis a OneXOneyY statistic. 


If there is a single X variable and multiple Y vanables assigned, 
mean values for each Y variable are compared for the groups defined 
by X. If there is a single Y variable and multiple X variables 
assigned, the means for variable Y are compared for the set of 
groups defined by each X variable. If there are multiple X and Y 
variables assigned, a result is calculated for each X;- Yj pair 


(matching subscripts). 


The unpaired two group t-Test compares two sample means 
determined from two independent samples. Statistically, we assume 
that the two means have been defined by samples from the same 
population, and therefore we would expect the difference between 
the means to be 0. The statistical analysis addresses the question of 
whether or not the observed difference between the two means Is a 
chance difference, either larger than we might expect by chance, a 
one-tail test, or more extreme than we might expect by chance, a 
two-tail test. 
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Correlation 


Coefficient 


Unlike the paired t-test example, we don't have to worry quite as 
much about illogical analyses. When we have independent samples 
StatView II essentially splits a single Y column on the basis of the 
groupings associated with the X column. Consider again the 
Cholesterol counts for males and females. All of our preliminary 
graphic work suggested that there simply is no difference between 
mean Cholesterol levels of males and females. If you were going to 
generate a hypothesis on the basis of our preliminary analyses you 
would hypothesize that the male Cholesterol mean, 190.085, is not 
significantly different from the female Cholesterol mean, 194.625. 
This is a two-tail hypothesis. 


‘> Assign an X variable, X1, to the Gender column. 
*> Assigna Y variable, Y1, to the Cholesterol column. 


*> Choose t-Test from the Compare menu to display the t-Test 
dialog box. 


*> Click Unpaired. Click OK. 


*> Choose Table from the View menu and the following table 
appears: 


Unpaired t-Test X1: Gender Yj: Cholesterol 


OF: Unpaired t Value: Prob. (2-tail): 





Group: Count: Mean: Std. Dev.: Std. Error: 


male 190.085 35.299 4.189 


Assuming a significance level of .05, the table, not surprisingly, 
suggests that the difference between the male and female Cholesterol 
levels, 4.54, is most likely due to chance and may assumed to be 0. 
The Cholesterol counts for the two samples are not significantly 
different from each other. 


No graphic views are available. 


A correlation coefficient indicates the degree of linear relationship 
between two variables. Generally, it is assumed that you have a 
single sample with two sets of observed values, an X anda Y 
observed value on each subject or sampling unit. 


A positive correlation suggests that, as the observed values on one 
variable increase or decrease, so do the observed values on the other 
variable increase or decrease proportionately. A negative correlation 
coefficient suggests that ,as the observed values on ore variable 
increase, the observed values on the other variable decrease 
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proportionately. A correlation of 0 suggests that there is not a linear 
relationship between the two variables. 


The previous analyses of Cholesterol] counts taken before and three 
years after exposure to dietary effects on cholesterol suggested that 
the Cholesterol] counts were reduced over the three year period. You 
might wonder whether there 1s a linear relationship between 
Cholesterol counts taken during the two time periods. A positive 
correlation would suggest that there was most likely a general 
decrement in Cholesterol regardless of initial cholesterol level. A 
negative correlation would suggest that those with large initial 
Cholestero] counts tended to show extremely large cholesterol 
decrements while those with low initial Cholesterol counts showed 
an actual increase in cholesterol count to the extent that the subject's 
rank order on Cholesterol was actually reversed. A O correlation 
would suggest that the occurred randomly reordered the subjects. Of 
the three possible outcomes, that which is associated with a positive 
correlation seems to be the most reasonable expectation. 


There are two ways that StatView II can be used in calculating a 
correlation. Either a correlation coefficient can be computed betwee 
an X and Y variable or a correlation matrix can be computed relating 
a set of X vaniables.The correlation coefficient option provides both 
a bnef summary table and a scattergram of the X-Y pair. The matrix 
option, typically associated with a number of X variables, does not 
provide a scattergram and does not provide a table summary of the 
relationship. Rather, the matrix approach defines a StatView II 
dataset that is a correlation matrix. Such a matrix may be saved and 
may also be modified for future use with factor analysis. 


‘> Choose Correlation from the Compare menu and the 
Correlation dialog box is displayed: 


Select Correlation: 


© Coefficient using H-Y¥ pair(s) 


@© Matrix using H Columns 
[_] save correlation matrix 


Co) 


Options lets you specify computing either a correlation coefficient 
between an X and Y variable or a correlation matrix a set of X 
variables. 





If you choose to compute a correlation matrix, you have the option 
of saving the matrix as a StatView II dataset. 
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Correlation Coefficient 


Let us assume that we are going to determine the correlation between 
initial Cholesterol and Cholesterol measured three years later. Our 
expectation, as rationalized above, is that we will find a positive 
correlation. You must assign an X and a Y variable to compute a 
correlation coefficient . 

This choice for a correlation calculates the correlation coefficient 
between individual X and Y variables. It is a OneXOneY statistic . If 
there is a single X variable and multiple Y variables assigned, a 
result is calculated for each Y variable. If there is a single Y variable 
and multiple X variables assigned, a result is calculated for each X 


variable. If there are multiple X and Y variables assigned, a result is 
calculated for each Xj-Yj pair (matching subscripts). 


*> Assign an X variable, X1, to the Cholesterol column. 
*> Assigna Y variable, Y 1, to the Chol-3yrs column. 


*> Choose Correlation from the Compare menu. 
‘> Click Coefficient using X.Y pair(s). Click OK. 


*> Choose Table from the View menu. The following table is 
displayed. 


Corr. Coeff. X1: Cholesterol Yj: Chol-3yrs 


Covariance: Correlation: R-squared: 
976.878 


Count: 





Note: 52 cases deleted with missing values. 


As expected, the correlation, .719, is a reasonably large positive 
coefficient. This suggests that there is a substantive positive linear 
correlation between the two measures of Cholesterol . 


The square of this coefficient, .517, suggests that approximately 52% 
of the variation in third year Cholesterol may be predicted given 
initial Cholesterol measurement. It would probably be worth while to 
investigate this relationship further with a regression analysis. 


*> Choose Scattergram from the View menu. 
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Scattergram for columns: X,Y; r* = .517 


© Chol-3yrs 





100 120 140 160 180 200 220 240 260 280 2300 
Cholestero] 


It is apparent that the points of the scattergram rise from the lower 
left to the upper right, thereby suggesting a positive relationship. The 
Scattergram 1s also informative from the perspective of calling our 
attention to the fact that the points become heteroscedastic as the 
associated Cholesterol] observed values become larger. There are 
several] outliers that might have a distorting effect on the magnitude 
of the correlation coefficient. In particular there is an individual who 
had an observed Cholesterol value of approximately 200 on the 
initial measure and an observed Cholesterol value of approximately 
260 three years later and another with a level of 260 on the initial 
measurement and a level of 280 three years later, both the reverse of 
the general trend. If these two sets of scores are eliminated from the 
dataset the correlation increases in magnitude to .75. However, you 
are well advised to understand the nature of outliers before 
eliminating them from analyses. We will return to these outliers 
when we address the question of regression. 


Correlation Matrix 

This choice calculates a correlation matrix for all assigned X 
variables. The matrix uses only cases that are complete (non-missing 
or row-wise deletion for missing values) across all X variables. It is a 
Many X statistic. One result is calculated using all assigned X 
variables. 


*> Assign an X to Weight, Cholesterol, Triglycerides, HDL and 
LDL columns. 


‘> Choose Correlation from the Compare menu to display the 
Correlation dialog box. 


*> Click Matrix using X column. 
*> Click save correlation matrix. Click OK. 


«> Choose Table from the View menu. 
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0 


Correlation Matrix for Yariables: X1 ... &5 


Weight 
Cholesterol 





Triglycerides 
HDL 
LDL 






-2re ase [are 
el CS aE ee ee ae 


This view gives the number of X columns used. The correlation 
matrix provides the names of all variables compared and the 
correlation coefficient values. If there are more than eight variables, 
the matrix occupies additional pages. This view represents an 
extension of the correlation coefficient. It computes all pairwise 
correlations between the selected X variables. 





Note: Some correlations reported in the correlation matrix may not 
appear to agree with the value reported as a single correlation when 
correlation coefficient has been selected. This discrepancy may arise 
from row-wise deletion of incomplete data. 


When a correlation matrix is selected, the correlations are based only 
on the observed values associated with cases that are complete across 
all X variables selected. When a correlation coefficient has been 
selected the correlation is based only on the observed values 
associated with cases that are complete across the X and Y variable 
selected. Thus, it 1s possible that, when there are missing data, a 
correlation computed by a correlation matrix may be based on fewer 
cases than it would be if it were computed with a correlation 
coefficient. 


There are no graphic views available. 


Saving a Correlation Matrix as a Datafile 


An option in the correlation coefficient dialog box allowed the 
correlation matrix to be saved as a StatView II datafile. Bring the 
new window to the front of the screen by selecting it from the 
Windows menu. 








2SSESESe_aS_==|*_ Ui _hitlei-2 BSSSSSSassa=L=E 

Trigiycerides | HOL © 

ae 1.000; -.022 | 108| -.276 

|_2[ Chole... | -.022| 1.000] 401] 352 
3| Trigly... 108 | 401 1.000| -.278 

| 4/HDL | -.276 | a2 -.278/ 1.000 

e S| LDL | CEES 962 | 489.083 
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The complete square correlation matrix is saved. The first column 1s 
a string column which contains the variable names. Although the 
data values in the new dataset are displayed to the number of 
decimal places specified for analysis results, the variables are saved 
to 18 decimal places. More places may be displayed using the choice 
Format from the Tools menu. 


Re eres SLON Three regression models are computed using the Regression choice 
in the Compare menu: 


: simple linear regression 
ws multiple regression 
° polynomial regression 
Each analysis computes: 
¢ RR, R* adjusted R2, root mean square residual 
° ANOVA table 
° residual statistics 
° beta coefficient table 
° confidence interval tables 


The following values can be computed and added in columns to the 
data window: 


° residuals 

° standardized residuals 
° fitted values 

° predicted values 


*> Choose Regression from the Compare menu, and the 
Regression dialog box 1s displayed. 
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Select Regression: 

@ Simple © Multiple © Polynomial 
Select the order of the regression: 
GA.~.tirwist £5 Of ©? OF O9 
Confidence intervals: 95% and 


Add as a new column to the data window: 
{_] Residuals 


C] Standardized Residuals 
C] Fitted Values 


C] Predicted Values 
Create new column(s) using values from: 
@ included rows O all rows 


Compute residual statistics: ©@No O©O Yes 


Select which regression model you wish to analyze using radio 
buttons at the top of the box. If you select a Polynomial “Regression, 
select the order of the highest power in the regression model using 
radio buttons directly below the regression type buttons. 





Two confidence intervals can be calculated. The 95% confidence 
interval is automatically calculated. Enter another interval in the text 
entry rectangle following the 95% label, if you wish. 


Selecting the appropriate check boxes saves the following 
information as new columns to the data window: 


Residuals — Computed as the etreenae between the observed 
Yj and the fitted by? 


Standardized Residuals — Computed as the conversion of the 
residuals to unit normal deviate form, mean of 0 and standard 
deviation of 1. 


Fitted Values — Computed by the regression model as the fitted 
Yj. 


Predicted Values — If you wish to apply the regression 
equation to additional data not present in the dataset, the 
resulting values are referred to as predicted values. To 
compute predicted values, enter the independent values after 
the last complete case for the model in the X column. That is 
after the last case where both X and Y values are present. 
These new entries would be the last rows of the dataset. The 
Y values must be left missing. 


Selections made using these check boxes are added as new columns 
to the end of the dataset. If a case has been deleted from the model, 
the new column contains a missing value in that row.The rows 
containing the X-Y pairs used in the regression analysis contain 
missing values in the Predicted Values column. 
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You may specify whether the above values are determined for just 
the included rows of the dataset or for all rows of the dataset. 


If you choose Included Rows, the values are calculated for just the 
included rows of the dataset: excluded rows contain missing values. 


If you choose All Rows, the values are calculated for all rows of the 
dataset regardless of their included or excluded state. With this 
option, you can fit a model for one group of data and use this model 
to predict for all groups. The included rows are used to estimate 
model coefficients. The excluded rows are used to compute predicted 
values 


You may choose to compute residual statistics. StatView II 
automatically selects this if you choose to add any new columns to 
the data window in the dialog box. 


Simple Regression 


This choice calculates a simple regression between a dependent Y 
variable and an independent X variable. It is a OneXOneY statistic. 
If there is a single X variable and » Y variables assigned, n models 
are computed. If there is a single Y variable and mn X variables 
assigned, n models are computed. If there are multiple X and Y 
variables assigned, a result is calculated for each Xj- Yj pair 
(matching subscripts). 


Previously, with correlation and the Lipid Dataset, we looked at the 
correlation between initial Cholestero] count and Cholesterol count 
three years after the subjects received instruction on reducing 
cholestero] through dieting. The resulting correlation coefficient was 
quite large, r=.719. If we wished to predict possible Cholesterol 
reduction over a three year penod through dieting could we predict 
the future Cholesterol value for an individual given the individual's 
current Cholesterol count? Note that we are not predicting 
Cholesterol reduction. This is a regression question. 


The regression model assumes two sets of individuals from the same 
population: those from whom the regression Statistics are derived 
and those to whom the regression results are generalized. A 
regression equation is denved to predict the dependent variable from 
the independent vanable and then summary statistics are computed 
to determine how well the regression model will work. 

‘> Assign a Y to Chol-3yrs. 

*> Assign an X to Cholesterol. 


*> Choose Regression from the Compare menu to display the 
Regression dialog box. 


*> Click Simple. 


«> Click Predicted Values. 
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*> Check the Compute residual statistics box 
*> Click OK 


In order to compute predicted values, we must enter new data at the 
end of the X variable, Cholesterol, column. 


> Enter the data points 160 and 260 at the bottom of the Cholesterol 
column. This will create two new rows in the dataset. These will be 
the Cholesterol scores for which predicted Chol.-3yrs scores will be 
obtained. 


> Choose Table from the View menu. 


Simple Regression X1: Cholesterol 1: Chol-3yrs 


Count: . R=-squared: Adj. R-squared: RMS Residual: 







Analysis of Yariance Table 
Source Sum Squares: Mean Square: Potest: 


oe a aT = 28782.079 28782.079 43. 956 
RESIDUAL 0 .tti‘é‘ Ce ce 654.797 = 0001 
TOTAL 42,—s«'5356228.744 


Residual Information Table 
SS{e(i)-e(i-1)]: 220: e <Q: DW test: 


42372.135 


Note: 54 cases deleted with missing values. 






The count represents the number of paired observations used in 
computing the correlation. R is the correlation between Cholesterol 
and Chol-3yrs. The value for R? is the square of the correlation 
coefficient, and is interpreted as the proportion of variance of the 
dependent variable that is predictable from the independent variable. 
Thus, you can also estimate R? from the ANOVA table as the ratio 
of the regression sum squares to the total sum squares. The adjusted 
R? is the square of the correlation adjusted for sample size and the 
number of coefficients in the regression equation. It is the unbiased 
estimate of the population squared correlation coefficient. The root 
mean square residual is just the square root of the mean square for 
residual of the ANOVA table. The root mean square residual 
represents the standard deviation of the residuals which are the errors 
of prediction. 


The ANOVA table represents a partition of the total sum of squares 
into predictable, regression, and unpredictable, residual sum of 
squares. The regression mean square is the variance of the fitted 
values while the residual mean square is the variance of the residual 
values. The F-ratio, listed under F-test, is formed as the ratio of the 
regression Mean Square to the residual mean square. The p value 
under F-test represents the probability of an F-ratio of 43.956, with 1 
and 42 degrees of freedom, would occur by chance sampling 
fluctuation. If you are operating at the .01 level of significance, you 
would conclude that these are significant results. That the amount of 
predictable varnance, regression mean square, is significantly greater 
than 0. 
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The regression model assumes that the residuals are uncorrelated. In 
the situation where the data are from an ordered sequence, it is 
assumed that the residuals are not dependent upon their neighbors. 


It is very important to note that our Cholesterol data are not 
appropniate for the Durbin- Watson test. However, to facilitate an 
understanding of this test, let us assume that the order of the paired 
observations is fixed and meaningful. To test the assumption that the 
residuals occur independently of the order of the paired observations 
the Durbin-Watson is used. The Durbin-Watson statistic, 1.578, is 
the sum of squares of the difference between successive residuals, 
divided by the residual sums of squares. To interpret the Durbin- 
Watson StatiStic it is necessary to consult a Durbin- Watson table. If 
the statistic is less than the range in the table, it is assumed that the 
residuals are independent. If the statistic is greater than the range in 
the table, 1t is assumed that the residuals are not independent and that 
they are correlated. If the statistic falls within the range in the table. 
the test 1s inconclusive. 


Using the table provided by Neter and Wasserman, we find ‘hat for a 
sample of 43 the range in the table for the statistic is 1.48 to 1.57 for 
a .05 level of significance. Thus, if we consider our value of 1.578 «< 

»~ceeding 1.57 then we must conclude that the residuals are not 
independent. However, remember that this conclusion 1s meaningless 
within the context of our Cholesterol data since the order of the 
paired observations 1s arbitrary. 


Also 1n the residual information table, we have a count of the 
frequency of positive residuals and the frequency of negative and 0 
residuals. These two frequencies should be approximately equal if 
the residuals are independent. If we find widely discrepant 
frequencies, we can conclude that a linear regression model may not 
be the most appropriate regression model for the data. 


e> Click the scroll bar. You see: 


Simple Regression X;: Cholesterol Y 1: Chol-3yrs 


Beta Coefficient Table 
Std. Err.: Std. Coeff.: 


INTERCEPT 47326 i 
SLOPE — lt UC? ie 


Confidence intervals Table 
Variable: SSS Lower. 25% Upper : S06 Lower. SOS Up 


ME AN (X_Y) 173.62 18S .293 174.944 188.07S 
in ie ee) ee | ee 


Variable: Coefficient : t-Value: Probability : 

























Predicted : Column 26 


The beta coefficient table provides the information required for the 
linear regression equation. Assume Y’j to be the 7th fitted value for 
Chol-3vrs. Assume Xj to be initial Cholesterol count for the ith 
person. Then the general regression equation 1s: Y'j = .702 Xj+ 
47.328. These coefficients are taken directly from the beta table. The 
beta table also has a two-tailed test to determine if the slope is 
significantly different from 0. For simple regression, this test 
provides information that has already been determined in the 
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ANOVA table. Notice that when the F-ratio has one degree of 
freedom, it is the square of the corresponding t-ratio. 


The confidence intervals table provides the 95% and 90% confidence 
bands for the slope and for the mean of the dependent variable, Y. 
The second confidence interval value, in this case 90%, is user- 
entered through the Regression dialog box. The interpretations for 
these confidence intervals is the same as the interpretation discussed 
in the Confidence Interval section of Chapter 4. 


If you bring the dataset window to the front of the screen and scroll 
to the nght you will see that Column 26 has been added to the data 
window. Its last two rows contain the predicted values, 159.569 and 
229.720, that correspond to the values 160 and 260 entered at the 
bottom of the Cholesterol Column. 


‘> Choose Scattergram from the View menu. 


y = .702x + 47.328, r2 = 517 


280 
260 


© Chol-3yrs 





100 
100 120 140 160 180 200 220 240 260 280 300 


Cholesterol 


A scatterplot of the points along with the fitted regression line for 
predicting Y| given X| is displayed. Notice in this plot that the 
regression line appears to pass beneath most of the points associated 
with initial Cholesterol counts greater than approximately 230, and 
Chol-3yrs. This suggests that the residuals are not independent of 
the observed values and that there will be positive residuals 
associated with these points. 


This scattergram contains a tool specific to the Simple Regression, 
namely the confidence bands control. 


AS 


Clicking this control presents the following dialog box. 
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Select confidence information for 6 Simple Regression 


L) 95% confidence limits for slope of regression line 
[) 95% confidence bands for the true mean of Y 


L) 90% confidence limits for slope of regression line 
[) 90% confidence bands for the true meen of ¥ 





For both confidence intervals, 90% and 95%, you can plot: 
* the confidence bands for the slope of the regression line 


_¢ the confidence band for the mean of the dependent variable 
XA 


*> Click the both choices for the 95% confidence bands, and click 
OR. 


y = .702x + 47.328, r2 = 517 


© Chol-3yrs 





100 120 140 160 180 200 220 240 260 280 300 
Cholesterol] 


The confidence bands for the slope of the regression line appear to 
pass through the regression line, regardless of the level of 
confidence. Actually, the confidence bands touch the regression line 
at the point of intersection of the X and Y means on the regression 
line. These confidence bands provides some sense of the stability of 
the slope of the regression line. These bands provide a range within 
which the true regression line may fall. 


The confidence bands for the mean of the dependent variable, Y 
given X, follow a similar pattern. The further removed the regression 
line is from the point of intersection of the X and Y means, the less 
stable is the predicted Y point of the regression line for the 
associated X. It is clear for our Cholesterol data, as well as for most 
other data, that the consequences of sample instability are most 
severe at the extreme ends of the X distribution. 
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Of course, if you had selected the 90% confidence band, the bands 
would be wider than for the 95% confidence band. 


*> Click the confidence bands tool once more and remove the 
confidence bands. 


You can plot more than one regression model on a single graph. If 
you were to assign a second Y variable, Y7, or a second X variable, 


X7, the scattergram will display the results for this new variable. 


Recall from the demonstration of the correlation matrix that LDL, 
Low Density Cholesterol, had a high correlation with the initial 
measure of Cholesterol. It should also have a high correlation with 
Cholesterol measured three years later, especially if the initial 
Cholesterol is highly related to Cholesterol measured three years 
.later, Chol-3yrs. However, assuming that the degree of linear 
relationship is essentially the same, you may wonder whether the 
slopes of the two regression lines are the same or whether the nature 
of the relationship has changed. 


*> Assign an X to LDL. 


The second regression model, using LDL as an independent variable, 
is calculated. The scattergram now displays the results for the 
dependent Y, Chol-3yrs, against the second X variable, LDL. 


*> Click the paging/composite tool. 


The scattergram now displays the results of the second analysis 
within the context of the first regression scattergram. Notice that two 
sets of points are plotted and represented by different shapes. 
Clearly, the slopes of the two regression lines are for ail practical 
purposes parallel thereby implying that the slopes for the two lines 
are not different and that the nature of the relationship has not 
changed. 


Scattergram for columns: X,Y 1 .. X2Y; 
280 


© Cholesterol 
OOLDL 





100 
390 #73 100 125 150 175 200 225 250 275 300 


Cholesterol 


Care must be taken when using this option. If you select two 
variables for X that have a discrepant metric associated with them, 
such as HDL and Cholesterol, the resulting scatterplot can take on a 
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rather deceptive form that does not represent either scattergram 
adequately. 


Multiple Regression 


The Multiple Regression selection from the regression dialog box 
calculates a multiple regression between a dependent Y variable and 
one or more independent X variables. Variables are entered into the 
equation in the order of their subscripts, X] to Xp. Itis a 
ManyXOneyY statistic. A result is calculated for each Y variable 
assigned. 


Suppose that we use our other blood variables, Triglycerides, HDL, 
and LDL, in an effort to predict Chol-3yrs. These are reasonable 
substitutes for initial Cholesterol] level. However, rather than using a 
single variable with a single regression equation we will use three 
variables with a single regression equation. The rationale for using 
several independent variables is that each may have a unique effect 
upon the dependent variable. Such a model assumes that the 
variables predict more collectively than each could predict 
independently of the other two. 


«> Assign Y to Chol-3yrs. 

‘> Assign an X to Triglycerides. This will be Xj. 
‘> Assign an X to HDL. This will be X2. 

‘> Assign an X to LDL. This will be X3. 


«> Choose Regression from the Compare menu to display the 
Regression dialog box. 


*> Click Multiple. 


«> Save the following three values to the data window by clicking 
Residuals, Standardized Residuals, and Fitted Values. 


*> Click OK. 


«> Chose Table from the View menu. 


Multiple Regression Y; :Chol-3yrs 3 X variables 





Analysis of Variance Table 
Source DF : Sum Squares: 


Mean Square: Fetest: 
[REGRESSION [3 —s—“s~séd gig ses |963e.862 [14.073 
RESIDUAL ‘(39 —————CS—SC*d 27ND (684.927 tp = 0001 
SS "i a 


Residual information Table 
: e <0: DW test: 













Note: 52 cases deleted with missing vaWwes. 
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All of this information is interpreted in a fashion identical to the 
interpretation discussed above in the simple regression section with - 
the exception that the ANOVA table F-ratio is no longer the square 
of a t-value from the beta coefficient table. 


This multiple regression model accounts for approximately 52% of 
the variance associated with Chol- -3yrs. The multiple regression 
model accounts for a Statistically significant portion of the variance, 
(F=14.073; DF=3,39; p<.0001). That i is, the proportion of variance 
accounted for, 52%, is more than you would expect just by chance 
alone. 


*> Click the down arrow on the scroll bar. 


Multiple Regression Y1;:Chol-3yrs 3 X variables 


Beta Coefficient Table 









Variable: Coefficient: Std. Err.: Std. Coeff.: t=Value: Probability : 
ou ee: elec iam eas 
Triglycerides 223__2s§___os3_ sg? $715 





ee ee ee i ee ee ee 
LDL [Sea 2 2 







Residual : Column 26 Std. Residual : Column 27 Fitted : Column 23 


If there are more than 7 coefficients, 7 independent variables, the 
table is continued on successive pages. Assume Y'j to be the ith 
fitted value for Chol-3yrs from the 3 variables. Assume X1]j to be 
the Triglyceride value for the ith person. Furthermore assume X?j to 
be the '"YL value for the ith person and X3j to be the LDL value for 
the ith person. Then the general regression equation to determine Yj 
1S: 


Y'= .028X1j + .618X2; + .694X3; + 50.252 


The entries of the beta coefficient table may be interpreted in a 
fashion similar to the interpretation applied to the beta coefficient 
table associated with simple regression. The t-value and associated 
probability is a two-tailed test to determine whether or not the 
associated regression coefficient is significantly different from 0. It 
would appear from this table that only the LDL coefficient is 
significant, p<.0001. The standard coefficient is the standardized 
regression coefficient for use with standard scores, as opposed to the 
coefficients which are used with the observed scores. 


The note on the bottom of the page names the columns which 
contain the computed values selected from the regression dialog box. 
If any values could not be added to the dataset because of memory 
constraints, a message would also be found here. 


*> Click the down arrow on the scroll bar. 
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Multiple Regression Y4:Chol-3yrs 3 X variables 


Confidence intervals and Partial F Table 
Variable: SSS Lower: 95% Upper : SOS Lower: SOS Upper: Partial F: 


a ei 
Trigweerides_|-103____| 458 |-os1 usa ves 
fees ager © ancy) loss ae a 


LDL 398 989 









There is an additional statistic associated with the multiple 
regression model that is not associated with the simple regression 
model: the partial F-test. Not surprisingly, the partial F value is the 
Square of the t-value associated with the coefficient in the beta 
coefficient table. It has (1, residual DF) degrees of freedom If 
significant, this partial F-value identifies a variable that is a 
Statistically significant given that all other independent variables 
have been included in the model. The partial F addresses the 
question of whether the addition of this specific independent 
variable, given the other variables in the regression model, 
significantly contributes to the predictable variance. When using 
StatView II, it is convenient to check the significance associated 
with the t-value of the beta coefficient table to determine whether or 
not the partial F is significant. Only the LDL variable makes a 
significant contribution to the multiple regression equation. 


Polynomial Regression 


This section in the regression dialog box calculates a polynomial 
regression between a dependent Y and a polynomial independent X 
variable. Specify an order (degree) of the polynomial between 2 and 


Itis a OneXOneY statistic. If there is a single X and multiple Y 
variables assigned, a result is calculated for each Y variable. If there 
1s a Single Y and multiple X variables assigned, a result is calculated 
for each X variable. If there are multiple X and Y variables assigned, 
a result is calculated for each Xj-Yj pair (matching subscripts). 


Simple regression dealt with fitting straight lines. There are datasets 
for which straight lines regressions may be “supplemented” by 
considering more complex lines. One such model for fitting complex 
lines to a single independent and dependent dataset is the polynomial 
regression model sometimes referred to as curvilinear regression. A 
polynomial regression line is a curved regression line. 


The first-order regression line predicts Y from X as a straight line. A 
second-degree polynomial predicts Y from X and X2 and is a a curve 
with a single point of inflection or bend. A third-degree polynomial 


predicts Y from X, X2, and X3 and is a curve with two points of 
inflection or two bends. StatView II allows you to use polynomial 
models up to ninth-order polynomials. 


A major problem for polynomial regression is one of determining 
what order to use. Theoretically, the highest order regression for any 
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set of data would be k-1, where k represents the number of distinct 
values for the independent variable. However, parsimony compels us 
to find the lowest possible order required to describe the data. As the 
order of a polynomial model increases, the proportion of predictable 
variance will always increase, but you must determine whether or 
not the increase is a useful increase. 


*> Assign a Y to Triglycerides. 
*> Assign an X to HDL. 


*> Choose Regression from the Compare menu to display the 
Regression dialog box. 


*> Click Polynomial Regression. 
‘> Click an order of 4. 
*> Click OK. 


Note that we did not request that residual statistics, the Durbin- 
Watson test, be computed. Because the observations that are being 
used as an illustrative example are not in a sequential order, this 
residual test is inappropriate. 


The ideal strategy to employ when using polynomial regression is 
that of starting with a high order polynomial equation and look at the 
scattergram. If there are no outliers, then look at the tables. 
Assuming that all outliers have been eliminated from the dataset, you 
then systematically reduce the polynomial order until the the 
regression sum of squares becomes significant and the probability 
for the highest order coefficient becomes significant. Some 
researchers will investigate even lower order models. How high is 
high? A good rule of thumb, but none-the-less quite arbitrary, 1s 
order four unless you believe a higher order would be more 
appropriate. 


*> Choose Scattergram from the View menu. 


y = 4494.9 - 360.629x + 10.992x2 - .148x> + .001x*4 


O Triglycerides 


Triglycerides 





0 
25 30 35 40 45 SO SS 60 65 70 75 
HOL 
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This view of the scattergram clearly shows several outliers. In 
particular, one HDL value is 480 while the typical value is between 
75 and 100. Most likely those values beyond 220 just don't fit with 
the other cluster of points and should not be included in the analysis. 
Polynomial regression is extremely sensitive to outliers. 


*> Click on the column heading for Triglycerides. 

*> Choose Range from the Tools menu. This will allow you to 
specify the range of values for Tniglycerides to be included in the 
analysis. Enter 220 for the upper bound. 


*> Click Done. Now only Triglyceride values less than 220 will be 
included in the analysis. 


*> Choose Table from the View menu. 


Polynomial Regression X1:HDL Y4: Triglycerides 


Count: R: R=-squared: 






Adj. R-squared: RMS Pesidua!: 













Analysis of Variance Table 
Source DF : Sum Squares: Mean Square: F-test: 
[REGRESSION [4 s'S628.796 _—«( 1.407.449 1.47 
RESIDUAL ee fe2327.105 [957.292 [p= 2ies 
2 “a. ~ e eer 





No Residual Statistics Computed 


If there are missing values a note will be placed on the bottom of the 
page indicating the number of cases deleted because of missing 
values. 


The initial summary table is identical to the summary table used with 
multiple regression. It is interpreted in an identical fashion to the 
multiple regression summary table. The fourth order polynomial 
regression equation does not predict a statistically significant 
proportion of the variance of the dependent variable since F(4.86)= 
1.47 and p=.2183. There is no need to continue looking at the other 
tables. We should reduce the polynomial order to 3 and recompute 
the tables. 


‘> Choose Regression from the Compare menu. 
¢> Click an order of 3. 
*> Click OK. 


StatView II automatically recomputes the polynomial regression 
with the new order. 
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Polynomial Regression X1{: HDL Yj 1: Triglycerides 


Count: R-squared: Adi. R-squared: RMS Residual: 











Analysis of Variance Table 
Source Sum Squares: Mean Square: F-test: 


REGRESSION aapseneeer: 5179.033 1726 .344 1.314 
RESIDUAL 82777.868 951.47 p= 1505 
TOTAL 90, —S————_—=é' 8 7956.90 NIC: HOP Got 






No Residual Statistics Computed 


The third order polynomial regression equation does not predict a 
statistically significant proportion of the variance of the dependent 
variable, since F(3,87)= 1.814 and p=.1505. There is no need to 
continue looking at the other tables.. We should reduce the 
polynomial order to 2 and recompute the tables. 


‘> Choose Regression from the Compare menu. 
¢> Click an order of 2. 
*> Click OK. 


Once again, StatView II automatically recomputes the polynomial 
regression with the new order. 


Polynomial Regression X1: HDL Yj: Triglycerides 
Count: R-squared: Adj. R-squared: RMS Residual: 


eT a Ps Zs 


Analysis of Variance Table 
Source Sum Squares: Mean Square: F-test: 


("a 5174252 2587.126 
RESIDUAL 82782.649 940.712 
TOTAL i90_~—S———sd' 7956.90 2 laweera aac 
















No Residual Statistics Computed 


Again we see a lack of fit. The second order polynomial regression 
equation does not predict a Statistically significant proportion of the 
variance of the dependent variable since F(2 »88)= 2.75 and p=.0694. 
There is no need to continue looking at the other tables. The 
curvilinear model, polynomial regression, does not fit the data. The 
only model left is the simple regression model, order one. 


‘> Choose Regression from the Compare menu. 
‘> Click Simple Regression. 
*> Click OK. 


The table view contains: 
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Simple Regression ®;1: HDL ‘Y1: Triglycerides 


Count: R: R=-squared: Ad}. R-squared: RMS Residual: 




















Analysis of Variance Table 
Source DF : Sum Squares: Mean Souere: Fetest: 










|REGRESSION | 5173 .846 5173.846 5.562 
RESIDUAL leo =——i(asi‘(tséséséséséd PHS. 930.147 > = 0205 





TOTAL iso —Ci(<asi‘(isé‘sét;*éd PSS. 





No Residual Statistics Computed 


The first order, linear, regression equation does predict a statistically 
significant proportion of the variance of the dependent variable, 
assuming a critical probability level of .05, since F(1,89)= 5.562 and 
p=.0205. If the additional table is reviewed it will be apparent that 
the regression coefficient is signif cant. 


*> Click the down arrow on the scroll bar. 


Simple Regression X1: HDL Y4: Triglycerides 


Beta Coefficient Table 
Variable: Coefficient : Std. Err.: Std. Coeff.: t-Value: Probability : 


INTERCEPT (123.46 
SLOPE -799 (339 —_|-243 = fo sg 


Confidence Iniervais Table 
25% Upper: SOK Lower: 






Variable: SSS Lower. 
















MEAN (XY) 93 32 81 652 92 282 
SLOPE ~1.672 Sa 2 eee 





As expected, the regression coefficient is significant. When 
describing the Triglycendes given HDL, the linear regression model 
1S Superior to the polynomial regression model. The linear regression 
equation for the ith fitted value is: Y'j= -1.472Xj + 80.614. 
Unfortunately ,we do not have a single set of variables in our 
illustrative datasets for which polynomial regression provides an 
adequate fit. However, for convenience of discussion let us assume 
that the third order model fit our data. 

*> Choose Regression from the Compare menu. 

‘> Click Polynomial Regression. 

*> Click an order of 3. 

*> Click OK. 


The first table has already been reported and discussed. Let us 
assume that the F-ratio is significant. 


¢> Click the down arrow on the scroll bar. ° 
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Polynomial Regression X{: HDL Yj: Triglycerides 


Beta Coefficient Table 













Variable: Coefficient: Std. Err.: Std. Coeff. : t-Value: Probability : 
arcerl lige 1 

a © RE CY” EY Te Tc 
J OSS a Oe a 
> 





If the order of the polynomial is higher than 7, the table is continued 
on the next page. If you chose to predict values, there will be a note 
on the bottom of the page that gives the name of the column which 
contains the predicted values. 


Following the prescribed strateg SY; we would expect all three 
regression coefficients (X, X2, X3) to be significantly different from 
0, p<.05. This information would be determined from the probability 
column. Like the multiple regression table, the square of the t-ratio is 
the partial F-ratio. Thus, we know that a statistically significant 
regression weight will be associated with a statistically significant 
partial F-ratio. From the above table the third order polynomial 
regression equation for fitting the ‘th Tnglyceride value from the ith 
HDL value ts: 


Y'; = 142.506 -2.022X + .025X2 - .00001691X3 


*> Click the down arrow on the scroll bar. 


Polynomial Regression X14: HDL Y 1: Triglycerides 


Confidence Intervals and Partial F Table 
Variable: 95% Lower: Der: 90% Lower: 90% Upper: Partial F: 


"ST Pea, ee a a 
Pe cal BIS) igusogs _|-29.733 25.59 oo" et 


PL TTT Ta CT a 005 
eS es os ee 


This table provides the confidence intervals for the regression 
weights reported in the previous table. It also reports the partial F- 
ratios. Those polynomials associated with a significant partial F-ratio 
are making a statistically significant contribution to the predictable 
variance. Most of the information in this table is derived from the 
previous table. 












‘> Choose Scattergram from the View menu. 
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Stepwise Regression 


y = 142.506 - 2.022x + .025x2 - 1.691E-4x3 
200 
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This view of the scattergram should clearly show the curvilinear 
nature of the third order polynomial regression line. Because of the 
actual poor fit of the polynomial regression model to the data and 
because the linear model is most appropriate, the third order 
polynomial regression line appears to be almost linear. 


We have provided a strategy for interpreting a polynomial regression 
model. We feel that this is the most effective strategy for 
interpretation. There are other strategies, but we have found them to 
be dangerously misleading with our illustrative datasets. We have 
also discovered that we have no data that adequately fit a polynomial 
model. Such data are difficult to find. Data that most usually 
conform to the polynomial mode] are time series data. 


When you consider the multiple regression model, it is important to 
keep in mind that a frugal solution is frequently desirable. We want 
the most efficient regression equation with the smallest number of 
variables. We want to be sure that every variable in the multiple 
regression equation makes a statistically significant contribution to 
the predictable variance. Most importantly, we want to predict as 
much of the variance of the dependent variable as is possible from 
the composite of independent variables. This whole process is 
complicated by the fact that the independent variables may be 
correlated with each other and consequently each predict a “same” 
part of the variation in the dependent variable. 


StatView II computes a multiple linear regression using the forward 
stepwise regression with elimination of unnecessary variables. The 
forward selection procedure selects as the next variable for the 
regression mode] that independent variable with the highest partial 
correlation with the dependent variable. Essentially, the partial F- 
ratio associated with each remaining variable is computed based 
upon the inclusion of a remaining variable into the existing equation. 
Of those variables not included in the regression equation, that 
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variable with the largest partial F-ratio is selected for inclusion and 
then new partial F-ratios are computed. 


The stepwise procedure used by StatView I] improves this forward 
selection by including an additional evaluative step in the procedure. 
With the inclusion of each new variable in the model, all variables 
previously entered are reevaluated. This reevaluation will remove a 
variable from the model if the variable's variance contributions are 
accounted for by variables subsequently entered into the model, that 
is, if a variable’s partial F-ratio becomes less than some 
predetermined value it is removed from the model. 


The StatView II stepwise procedure continues until no variables 
currently in the equation can be removed and the variable with the 
highest partial correlation not in the equation fails the F-to-Enter test. 


At each step, the results you have seen in the regression examples 
above are displayed. The following values can be computed and 
added to the data window: 


* residuals 
¢ standardized residuals 
* fitted values 
¢ predicted values 
‘> Choose Stepwise Regression from the Compare menu. The 


following dialog box ts displayed: 


Stepwise Regression Parameters: 


F-to-Enter F-to-Remove (3.996 | 


Force variables into the regression?: 


@No © Yes, force K variables Diaz WEP bd 


Add as a new column to the data window: 
[] Residuals 


(] Standardized Residuals 
(] Fitted Values 





(] Predicted Values 
Create new column(s) using values from: 
@ included rows Oallrows 


Compute residual statistics: @Na © Yes 


Fon 


Text entry rectangles at the top of the dialog box allow you to 

specify the F-to-Enter and the F-to-Remove values which control 
the entry and removal of variables into the equation. The value 4 is 
the default for F-to-Enter. The value 3.996 is the default for F-to- 
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Remove. The F-to-Remove value must be less than the F-to-Enter 
Value. 


Below these, radio buttons allow you to choose to force variables 
into the equation. If you select Yes, then you must specify the 
sequence of variables to be forced. These X variables enter the 
equation first and stay regardless of their F-ratios. Variables are 
specified in the following manner: to force variables X3 through X6_ 
click Yes, force variables and enter 3 to 6. These numbers refer to 
the X variables’ subscripts. First variables are entered in the order of 
their subscripts, with X3 entering first. 


Check box selections add the following information as new columns 
to the data window: 


Residuals — the difference between the observed Yj and the 
fitted Yj. 


Standardized residuals — the residuals in unit normal deviate 
form. 


Fitted values — the fitted Yj values determined by the 
regression model. 


Predicted Values —Yj values predicted for additional 
independent values. To compute predicted values, enter the 
independent values after the last complete case for the model. 
leaving missing values in the dependent column. That is after 
the last case where both X and Y values are present. The 
program calculates predicted values using the determined 
regression equation. 


Selections made using these check boxes are added as new columns 
to the end of the dataset. If a case has been deleted from the model, 
the new column contains a missing value in that row.The rows 
containing the X-Y pairs used in the regression analysis contain 
missing values in the Predicted Values column. 


Radio buttons allow you to specify whether the above values are 
determined for just the included rows of the dataset or for all rows of 
the dataset. If you choose Include Rows, the values are calculated 
for just the included row of the dataset; excluded rows contain 
missing values. 


If you choose All Rows, the values are calculated for all rows in the 
dataset regardless of their included or excluded state. In this manner, 
you may fit a model for one group of data and use this model to 
predict for all groups. 


You may compute residual statistics without adding them as new 
column. StatView II automatically selects this if you choose to add 
any new columns to the data window. 


Assign a dependent Y variable, Yj, and one or more independent X 
variables. It is a ManyXOneY statistic. Only one stepwise regression 
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can be calculated at a time. There can not be two Y variables 
assigned. 


The illustrative dataset, Lipid Data, has a collection of variables that 
deal with blood and blood flow. Is there some combination of these 
variables that predict Cholesterol count? It was demonstrated in the 
correlation section that LDL is highly correlated with Cholesterol 
count, r=.96. If we were to use LDL with the other variables in a 
multiple regression analysis the analysis would not tell us too much 
about the variables and Cholesterol because LDL would predict most 
of the variation of Cholesterol. However, if we were to establish a 
multiple regression equation from the variables, excluding LDL, 
would there be some combination of these remaining variables that 
predict a statistically significant (p<.05) portion of the Cholesterol 
variation? Stepwise regression will address this question. 


*> Assign a Y to Cholesterol. 


*> Assign an X to Triglycerides, HDL, Weight, Systolic BP, and 
Diastolic BP, X1, X2, X3, X4, and X5_ respectively. 


‘> Choose Stepwise Regression from the Compare menu. 


*> enter a value of 3.953 for F-to-Enter. This is the critical F-ratio, 
for 1 and 93 degrees of freedom, that must be exceeded for 
significance if alpha is set equal to .05. Any variable with a 
partial F-ratio of 3.953 or greater will be entered into the 
regression equation. 


*> Enter a value of 3.953 for F-to-Remove. Any variable with a 
partial F-ratio less than 3.953 will either be removed from the 
regression equation if it has been previously entered with a larger 
F-ratio or it will not entered at all. 


*> Click OK. 
*> Choose Table from the View menu. 


Stepwise Regression Y1:Chclesterol 5 X wariables 


Summary Information 


VariablesForced [0.0 







No Residual Statistics Computed 


When the stepwise regression has been completed, this summary 
page appears. This summary information notes the dependent 
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variable name and the number of independent variables from the 
dataset that that were used in the analysis. The F-to-enter and F-to- 
remove values used in the procedure are displayed as well as the 
number of steps and the number of variables entered during the 
procedure. If any variables were forced to enter the regression, their 
ve-iable subscripts would be shown. A message states that no 
residual statistics were calculated. If residual statistics had been 
specified, the Durbin- Watson test results would appear there. 


*> Click the down arrow. 


Stepwise Regression Yj -:Cholesterol 5 X wariables 


STEP NO.1 WARIABLE ENTERED: 1: Triglycerides 


R-squared: Adj. Resquared: RMS Residua: 










Analysis of Variance Table 
Source DF : Sum Squeres: Mean Sauare: Fetes? 


REGRESSION 1$6218.259 1921€.259 ver 
RESIDUAL iss, —s—s—*~é‘«é‘i CI 10.646 | 1079 6.884 
TOTAL ener Oiseezeens | 8 8=6—hltt~S 


The title gives the current step and the name of the variable 
entered/removed at this step. 
















These tables provide the same information as the tables shown 
earlier in this chapter under the discussion of Regression. 


The first step identified Triglycerides as the best single predictor of 
Cholesterol. In the second table, the value of R is the multiple 
correlation and represents the correlation between the observed 
dependent variable and the fitted scores predicted using 
Triglycerides as the only variable in the multiple regression equation. 
For this example the correlation between the fitted Cholesterol. 
predicted from Triglycerides, and the observed Cholesterol is .401. 
The proportion of Cholesterol variance that can be predicted by this 
single variable regression equation is .161, with the unbiased 
estimate of this value being .152. The root mean square residual, 
standard deviation of the residuals, is 32.852. 


The ANOVA table is interpreted in a fashion that is identical to that 
discussed with simple and multiple regression. The F-test, also the F- 
value, is substantially larger than the critical F-to-Enter thereby 
implying that the proportion of predictable variance, regression mean 
Square, 1s greater than you would expect by chance. 


*> Click the down arrow. 
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STEP NO. 1 Stepwise Regression Y;:Cholesterol 5S X variables 


Variables in Equation 


Variable: Coefficient: Std. Err.: Std. Coeff.: F to Remove: 
INTERCEPT 168.413 of 2 ee Cee 
icgenes “| eee ee ly 

Variables Not in Equation 
Variable: Par. Corr: F to Enter: 







Systete GP tei rte 


This page contains summary information regarding the regression 
equation based upon the selected variable and the partial F-ratios for 
those variables not selected. 






If any of these variables were forced into the equation a bullet (+) 
appears by the parameter name in the table. 


The above table is interpreted in a fashion similar to the 
interpretation that was used for the multiple regression example. 
From the above table the regression equation for predicting an initial 
Cholesterol count for the ith individual, Y'j, using the Triglycerides 
value, X2, may be defined as: 


Y'j = 168.413 + .235 X?2 


Of course the partial F-ratio of 17.8 is significant, p<.05, since it 
exceeds 3.953. Note that this F-ratio is labeled F-to-Remove since it 
determines whether the the variable remains in the equation. The 
partial F-ratios have been computed for the other independent 
variables not included in the equation. Their corresponding partial 
correlations are also reported. The partial correlation for an 
independent variable represents the correlation between the 
independent variable and the dependent variable with the effects of 
the independent vaniable(s) included in the regression equation 
partialed out. Notice that the partial correlations are always 
proportional to the partial F-ratios. 


The next variable to be selected for inclusion in the regression 
equation will be the variable with the largest significant, exceeding 
3.953, partial F-ratio. This will be the same variable that has the 
largest significant partial correlation. 


Tables like the above two tables are determined for each variable 
selected for inclusion, or removal from, the regression equation. For 
Our example there are only two variables included and none removed 
from the regression equation. The tables that follow are the tables 
associated with the last variable included in the equation. They are 
structurally similar to the previous table. 


*> Click the down arrow. 
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Stepwise Regression Y; :Cholesterol 5 & variables 
(Last Step) STEP NO. 2 WARIABLE ENTERED: %2: HDL 


R=-squared: Adj. Resquared: RMS Residue): 






Analysis of Variance Table 
Source DF : Sum Squares: Mean Square: F-test: 
















[REGRESSION [2 sd a7141.995 122570997 29.916 
RESIDUAL 72486 .91] 787.90} | 
TOTAL 119628.905 ted 


This second step identified HDL as the best predictor to be used with 
Cholesterol to define a two variable multiple regression equation. In 
the above table, the multiple correlation between the observed 
dependent variable and the fitted scores is .628. The fitted scores 
were determined predicted by using HDL and Triglycerides as the 
independent variables in the multipie regression equation. The 
proportion of Cholestero] variance that can be predicted by this new 
two variable multiple regression equation is .394, with the unbiased 
estimate of this value being .381. The root mean square residual, 
Standard deviation of the residuals, is 28.07. 






Notice that the regression degrees of freedom are 2 in the ANOVA 
table. We can no longer make a simple comparison between the F- 
ratio and the F-to-Remove. The regression sum of squares is now 
determined as a function of two independent variables. The 
regression equation is accounting for more variance because it has 
more variables in it. The predictable variance is significantly greater 
than 0 (p<.05). Furthermore, the ratio of the regression sum of 
Squares to the total sum of squares is still R*. So you might say that 
the proportion of predictable variance is greater than 0. 


«> Click the down arrow. 


STEP NO.2 Stepwise Regression Y;:Cholesterol ‘5 X variables 


Variables in Equation 
Variable: Coefficient: Std. Err.: Std. Coeff.: F to Remove: 







INTERCEPT 79.735 SS, SE ETA LATE 
Trigceriees [317 [ogo  _[5a; —si(gos7—t—s—~—s 
1.778 Fee es? 

Variables Not in Equation 
Variable: Par. Corr: F to Enter: 









rr CS ee 
Systolic BP |--116 246 
Diastolic BP_ toss 93 


From the above table the regression equation for predicting an initial 
Cholesterol count for the ith individual, Y'j, using the Triglycerides 
value, X2, and HDL, X3, may be defined as: 


Y'j = 79.735 +.317 X2 + 1.778 X3 
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Factor Analysis 


Notice that the regression weight associated with the Triglycerides 
has changed with the inclusion of the HDL variable, as has the 
intercept. 


The new partial F-ratios have been computed for the other 
independent variables not included in the equation, as have their 
partial correlations. This partial correlation represents the correlation 
between the independent variable and the dependent variabie with 
the effects of the two independent variables included in the 
regression equation partialed out . 


None of the remaining partial F-ratios exceed 3.953. Therefore, none 
of the remaining variables wit! be included in the regression equation 
and the computations are complete. 


Note that comparisons of R? values shows that Triglycerides and 
HDL, together do not predict Cholesterol as well as the singie 
variable LDL. 


This menu choice performs a factor analysis of a correlation matrix. 
The input can be raw data or a correlation matrix. Choose the initial 
factor extraction method from the dialog box. The choices are 
principal components, Kaiser image, Harris image, and iterated 
principal axis analysis. Specify either no transformation for the 
solution or one of: equamax, varimax, and quartimax. You also can 
specify the criteria that are used for determining the number of 
factors. Factor scores can be computed and saved. Plots of the 
unrotated solution, the orthogonal solution, and the oblique primary 
are provided. 


The discussion below is a brief clarification of the factor analysis 
tables produced. With minor exceptions, most of the discussion is 
contained in general form in one of the following three reference 
books: Factor Analysis by R. Gorsuch (1984), Multivariate Analysis 
With Applications in Education and Psychology by N. Timm (1975), 
and The Foundations of Factor Analysis by S. Mulaik (1972). 


‘> Choose Factor Analysis from the Compare menu. The 
following dialog box is displayed: 
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Factor Analysis Parameters 


input @ Rew Date © Corr. Matrix - * cases: eee a 


Facior Extraction Method Number of Fectors to Extract 
@ Principal Components @ Method Default 

© Harris Image Anslysis © Roots greater than one 

© Kaiser Imoege Ansiysis © 75% Voeriance Rule 

© Iterated Principal Axis © Root Curve 


@©smMC ODI f-liag O11 O User Specified we 


Transformetion Method-Orthotran/ 
@ Verimeak OEquaemex © Quertimsx ONo Transformetion 


[) Save Factor Scores © Oblique © Orthogonal 
[) Save correlation matrix 





Input Data 


There are two types of data that can be analyzed: raw data and 
correlation matrix data. 


Raw data is subject by case data (subjects being X variable columns 
and cases being rows). For such data, it is desirable that there be 
more observations than subjects. lf you wish to compute factor 
scores, the raw data input must be selected. 


Correlation matrix data requires the input data (the dataset) be a 
Pearson correlation matrix. The correlations must not be determined 
from different samples of subjects; they must be determined from the 
same total pool of subjects. 


StatView II expects the correlation values to be located in the lower 
left corner of the correlation matrix. Thus, you may use either a 
Square correlation matrix or a lower left correlation matrix as input. 
as shown below. If Correlation Matrix is selected in the dialog box, 
you must enter the Number of Cases used to determine the 
correlation matrix. This number is used for any multivariate 
significance tests performed on the data. 


Factor Extraction Method 


The factor extraction method, also referred to as the initial factoring 
procedure, determines the magnitudes of the communality estimates 
of the variables. These influence the magnitudes of the eigenvalues 
which ultimately influence decisions regarding the number of factors 
to extract. Eigenvalues are also referred to as characteristic roots. 
There are four factor extraction methods available: principal 
components analysis; Harris image analysis; Kaiser image analysis; 
and iterated principal axis. 


The principal components analysis is probably the oldest factoring 


procedure. It performs a simple eigenvalue-eigenvector analysis of 
the correlation matrix in its original form. This well established 
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factor method 1s utilized in a variety of disciplines and discussed in 
the applied research texts of these disciplines. 


Harris image analysis, a more recent factoring procedure, requires a 
nonsingular correlation matrix. It has a tendency to extract more 
factors than either of the other two non-image analysis factor 
extraction methods. It is properly referred to as a psychometric factor 
procedure, meaning that the model is based on a theory of variable 
sampling as opposed to the more traditional theory of statistical or 
subject sampling. This method factors a modification of the original 
correlation matrix, the image variance covariance matrix. Because of 
the large number of factors that define an image factor solution, the 
final rotated solution usually has a large number of zero loadings. 
However, the non-zero loadings are not always as large in magnitude 
as those large loadings observed in the more traditional factor 
analytic model. Furthermore, the Harris image analysis will 
determine factors that are defined by a single variable, pseudo- 
specific factors. 


Kaiser image analysis is a factor model that was defined by Kaiser in 
a manuscript devoted to a clarification of the Harris image analysis. 
It is distinct from the Harris image analysis in that it rescales the 
factor solution so it represents a factor analysis of the original 
correlation matrix and allows the user to impose the interpretive 
model of traditional factor analysis. Both image analyses define the 
same number of factors. While this is a trivial algebraic 
modification, the final, transformed Kaiser image analysis solution 
will have loadings that are quite different from those loadings of the 
Harris image analysis solution. 


The iter*ted principal axis is an iterative factor extraction method. 
This metnod requires that some estimate of the communalities be 
placed in the diagonal of the correlation matmx pnior to the initial 
factoring. The communality estimates are modified with each 
iteration. The factoring continues until the communality estimates 
stabilize. This factoring method requires significantly more 
computation time than the other factoring methods. [If you select this 
method of factor extraction, you are also required to select a method 
for determining the initial communality estimates. In the dialog box, 
choosing SMC causes the squared multiple correlations to be the 
initial estimates. Choosing Off-Diag causes the largest off-diagonal 
entries of the correlation matrix to be the initial estimates. Choosing 
1 causes 1 to be the initial estimate. The distinction between these 
initial communality estimates is quite easy to follow and is discussed 
in a number of factor analytic texts, including Gorsuch. If you chose 
SMC and your correlation matrix is singular, the largest off-diagonal 
entries of the correlation matmx is used as the initial communality 
estimates. 


Number of Factors to Extract 


The initial factoring method determines the properties of the initial 
factors. The number of factors retained for interpretation is almost 
always a function of the eigenvalues (sometimes referred to as /atent 
roots or just roots). There are three general criteria that are used for 
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determining the number of factors: roots greater than 1, root curve 
analysis, and extraction of 75% of the variance. 


The criterion of Roots greater than 1 specifies that as many factors 
will be retained as there are eigenvalues greater than or equal to ]. 


The criterion of 75% variance rule is determined by the sum of all 
eigenvalues, which is also the matrix variance. Keeping in mind that 
the eigenvalues are determined in order of descending magnitude, it 
becomes clear that each eigenvalue accounts for successively less 
variance than the eigenvalue preceding it. As soon as the sum of the 
proportionate contributions of the eigenvalues exceeds .75, it is 
assumed that all relevant matrix variance has been accounted for. 
The rank order of the eigenvalue that pushes the sum of the 
proportionate contributions over .75 is assumed to be an index of the 
total number of factors to be retained for further analysis. 


The root curve criterion is based upon the work of Cattell (1966) and 
Cattell and Jaspers (1967). Essentially, this criterion determines the 
eigenvalue associated with the point of inflection of a plot of the 
eigenvalues from largest to smallest. Inasmuch as the eigenvalues are 
always determined in order from largest to smallest, the rank order of 
the eigenvalue associated with the point of inflection is considered to 
be an estimate of the number of factors. It is frequently employed 
with interactive factor analyses. StatView II will plot this graph. 


Method default represents a criterion that is unique to the factoring 
method employed, and may or may not determine the same number 
of factors as one of the first three criteria. The default method for 
principal components 1s a combination of two criteria. It is the larger 
of the numbers determined by either the 75% variance rule or by the 
root curve analysis. 


The default method for the two image analysis models follows Harris 
(1962), and defines the number of factors by a mathematically 
precise method that works only for the image analysis model: Harris 
eigenvalues greater than 1. Harris eigenvalues are the eigenvalues of 
the image variance-covariance matrix. If you decide to apply one of 
the three criteria discussed above in place of the method default, the 
critenon selected is applied to a modification of the Harris 
eigenvalues. If you attempt to enter a specified number of factors, 
and this specified number of factors is greater than the number that 
might be determined by the image analysis method default. then the 
specified number is over-ridden by the number determined by the 
method default. 


The iterated principal axis method default is simply the number of 
eigenvalues greater than 1. 


You can specify the number of factors to extract. If you have decided 
to specify the number of factors pnor to the analysis, you should be 
aware of many Salient points. Traditionally, the maximum number of 
factors that you might expect in a factor analysis is half the number 
of variables being analyzed. Under no circumstances should the 
estimated number be larger than the number of variables. If, for 
some reason, you over-estimate the number of factors, the estimate 
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iS adjusted to the maximum possible number for the given mairix. 
This may be less than the number of variables, especially if the 
number of subjects defining the correlations is less than the number 
of variables. 


Transformation Vlethod 


The factor method and the number of factors represents the first part 
of factor analysis. The transformation method represents the second 
part of factor analysis. If you decide that you want no 
transformation, then the initial factor solution is considered to be the 
final solution matrix. 


If you want to define the final solution by some transformation 
solution, you have the option of three types of orthogonal solutions: 
varimax, equamax, and quartimax. Regardless of your choice of 
orthogonal solution, an oblique solution, the ortnotran solution, is 
computed from your orthogonal solution. All three orthogonal 
solutions attempt to define a Simple structure solution with the 
constraint that the factors remain orthogonal or uncorrelated. The 
orthotran solution relaxes the constraint of orthogonality, tolerating 
correlated factors, and determines a clearer simple structure than the 
orthogonal solutions. 


All three orthogonal solutions are computed by maximization of the 
orthomax criterion. StatView II computes normalized solutions. If a 
quartimax solution is selected, a majority of the variance is allocated 
to the first factor with the consequence that the other factors do not 
have a good simple structure. When an equamax solution i ls selected, 
the variance is allocated equally to all factors. When a varimax 
solution is selected, a solution between the equamax solution and 
quartimax soluuon is obtained. Typically, the varimax criterion is 
maximized unless you have a compelling reason for maximizing 
either of the other two criteria. 


The orthotran solution is a general transformation solution that 
refines the simple structure ‘of an orthogonal solution. It refines the 
simple structure by allowing the factors to be correlated, an oblique 
solution. If correlated factors do not refine the simple structure, then 
the solution matrix remains an orthogonal solution. If a better simple 
structure is obtained from an oblique solution, then the solution 
matrices are the primary pattern and the primary intercorrelation 
matrix or the reference structure solution and the primary 
intercorrelations. Whenever an orthogonal solution is selected, 
StatView II determines the associated orthotran solution as well. 


Save Factor Scores 


The third part of factor analysis deals with the factor scores. If you 
have a non-singular correlation matrix, it is possible to compute 
regression estimate factor score weights. Checking Save Factor 
Scores causes StatView II to compute and save the factor score 
weights. The factor scores are added as new columns to the end of 
the dataset. This option is available only if raw data has been input. 
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If you did not determine a transformation solution, you obtain 
unrotated factor scores. If you computed a transformation solution, 
select whether you wish to have orthogonal or oblique factor scores. 
If you select orthogonal, the factor scores show 0 intercorrelations. If 
you select oblique, the factor scores are correlated. More precisely, 
the intercorrelation of the primary factors represents the 
intercorrelations you would obtain if you were to actually compute 
the intercorrelations of the factor scores. 


Save Correlation Matrix 
If you check Save correlation matrix, the computed correlation 


matrix 1s saved to a new dataset window. It is saved as a square 
correlation matrix. 


Tabular Views | 

Assign an X variable to the data to be analyzed. If you are inputting 
raw data, assign an X to each variable to be used in dete. 2ing the 
correlation matrix. If you are inputting a correlation matrix, ass'7n an 
X to each column of the correlation matrix. The number X variauics 
that can be assigned is limited to 80. 


*> Assign a X to each column of the eight physical variables 
correlation matrix. 


*> Choose Factor Analysis from the Compare menu. We are 
inputting a correlation matrix. 


*> Click Correlation matrix, and enter 305 as the number of cases. 
*> Click Principal Components as the factor extraction method. 


*> Click Method Default to determine the number of factors to 
extract. 


*> Click Varimax as the transformation method. 

‘> Click OK. 

‘> Choose Table from the View menu. 

*> Page through the view as you follow the explanation below. 


The Factor Analysis displays the progress of the calculation. This is 
useful to track the progress of longer calculations. 
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Factor Analysis for physical variables: Xj .. Xg 


Summary Information 


When the factor analysis has been completed, this summary page 
appears. This page notes the dataset name and the number of 
variables from the dataset that that were used in the analysis. The 
factor procedure is noted along with the procedure used to determine 
the number of factors, the transformation procedure, and the actual 
number of factors defined. If factor scores were computed and saved, 
the columns they were saved in are noted beneath the summary 
table. Also, if the correlation matrix is either singular or ill- 
conditioned, such is noted underneath the table. 






Correlation matrix 


chest girthchest wi... 





lower le... weight ditrocha... 





height arm span forearm... 











height 

arm span 

merisiiin win ts <1. | | | 
lower leg... 

weight 


bitrochan... 


chest girth [301 [arr for [527 73 0 
chest wid..(382 [415.345 365629577 s39 fn 


The correlation matrix table is fundamental to factor analysis. The 
correlation matrix is the variance-—covariance matrix of the variables 
in a standard score format. The ith entry of the correlation matrix 
(row z and column /) is the correlation between variable i and 
variable /. The correlation matrix is printed in triangular form 
because half of the correlation coefficients of a correlation matrix 
duplicate the other half of the correlation coefficients. The 
correlation between variable / and variable / is exactly the same as 
the correlation between variable j and variable !. Virtually all books 
on factor analysis and multivariate analysis discuss the correlation 
matrix. 


Partials in off-diagonals and Squared Multiple R in diagonal 


chest girthchest wi... 





lower le... weight bitrocha... 





height arm span forearm... 








height 


weight 
ditrochan... 
chest girth oe 491 


a a 
chest wid..[=.086 [268 [087 _|-028 |238 177 _]12 | ave 


Guttman (1954) provided ample algebraic evidence that for a 
composite of variables that are going to be factor analyzed, the 
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partial correlations between the variables should approach 0. 
Furthermore, he argued, the multiple correlations for the variables 
Should be reasonably high. The partial correlation between any two 
variables, say height and lower leg in our example, is the correlation 
(.479) that exists between the two variables with the effects of the 
other variables in the matrix partialed out. That is, in the eight 
physical variables, the partial correlation between height and lower 
leg 1s an estimate of the correlation between the two variables based 
upon variation that is common to the two variables, but not common 
with any other variables in the matrix. 


The square of this partial correlation (.229) represents the proportion 
of variance of either vanable that could be predicted in a linear 
regression sense only by the other variable and not by any other 
variable in the matnx. Alternatively, the squared multiple correlation 
for a variable represents the proportion of variance for that variable 
that is common with all other variables in the matrix. The squared 
multiple correlation for height suggests that approximately 82 
percent of the variation in height may be predicted in a linear 
regression sense from the other seven variables. 


Measures of Variable Sampling Adequacy 
Total matrix sampling adequacy : .845 


height 

arm span 
forearm length 
lower leg len... 
weight 
bitrochanteri... 
chest girth 
chest width 





Bartlett Test of Sphericity- DF: 35 Chi Square: 2116.975 P: .0001 


Addressing Guttman's (1954) expectation of 0 partial correlations 
and large multiple correlations, Kaiser (1970) developed an index, 
called variable sampling adequacy, of the extent to which a matrix 
of partial and multiple correlations conforms to 0 partials and large 
multiple correlations. The notion of variable sampling adequacy 
follows logically from Guttman's discussion of partial and multiple 
correlations. Specifically, to the extent that a composite of variables 
is logically homogeneous (measuring the same universe of content) 
they are especially appropriate for factor analysis. The measure of 
variable sampling adequacy reported in StatView II is derived from 
Kaiser's (1970) equations. This index quantifies the extent to which a 
composite of variables, and the variables within the composite, 
conform to the desired expectation of the partial correlations tending 
toward 0. 


Kaiser argues that the sampling efficiency for the total composite of 
variables, total] matrix sampling adequacy (or MSA), should be 
greater than .500 in order to assume that Guttman's assumptions have 
been minimally met. For the eight physical variables, the index is 
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.845, which suggests that these data do indeed represent a 
homogeneous collection of variables and are suitable for factor 
analysis. As the index approaches 1, you may assume that the data 
are conforming almost perfectly to the assumption of 0 partial 
correlations. 


It is quite possible that one or more variables are different from the 
other variables in the composite and from each other. In such a case, 
the total MSA is depressed, perhaps even appearing to be less than 
.5O. Such variables each have a low index of variable MSA because 
they may not logically belong to the same psychometric universe of 
content as the other variables in the composite of variables. They 
will have an unpredictable in‘luence on any factor analysis done on 
the composite. Eliminating those variables with low indices of 
sampling efficiency will result in an improved index of total MSA, 
and a composite of variables that are perhaps more appropriate for 
factor analysis. For our example, the lowest index of variable MSA 
is .78 for weight. Even this lowest value is substantially larger than 
the minimum value of .50, thereby reaffirming that these variables 
are appropriate for a factor analysis. 


The measure of sampling adequacy is one of two evaluations that 
should be performed prior to any attempts to interpret the results of a 
factor analysis. The second evaluation is associated with the 
Statistical significance of the correlations. It has been demonstrated 
that interpretable factors may emerge from data that are totally 
random. Bartlett's test of sphericity is the multivariate analog of the 
Statistical test that is frequently applied to a single correlation 
coefficient to see if it is significantly different from 0. The test of 
sphericity is used to determine if, in general, the collection of 
correlations in the correlation matrix are different from 0. Ideally, a 
significant chi-square value is determined, thereby suggesting that 
the collection of correlations are different from 0 and most likely do 
not occur as a function of chance. For our illustrative data, the chi- 
Square value is 2116, which is significant at the .0001 level. These 
values are reported at the bottom of the summary of the MSA values. 
Thus, our correlations are, in general, significantly different from 0 
correlations. 


Eigenvalues and Proportion of Original Yariance 


Magnitude Variance Prop. 
Value | 
Value 2 
Value 3 
Value 4 





StatView II uses a powerful algorithm to determine eigenvalues: the 
HOW method. This method is based on the works of Householder, 
Ortega and Wilkinson. The method is discussed in great detail by 
Wilkinson (1965). Many properties of eigenvalues are beyond the 
scope of this discussion. Virtually all subjective criteria used to 
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determine the number of factors can be applied to the table above. 
When looking at this table of eigenvalues, it may be noted that the 
elgenvalues are presented in an order that corresponds to their size. 
Typically, there are as many eigenvalues as there are variables, and 
the sum of the eigenvalues is equal to the sum of the diagonal 
elements of the matrix from which they are determined. The variance 
proportion is an estimate of the proportion of variance that the 
eigenvalue and its associated eigenvector account for when they are 
used to define a factor. 


Usually, StatView II divides the number of variables by two to 
determine an initial estimate of the number of eigenvalues (which is 
also an initial estimate of the number of factors). The many rules for 
determining the number of final] factors are then applied to the 
eigenvalues that have been determined (see dialog box discussion on 
number of factors above). You may override the number of 
eigenvalues determined initially by equating the number of factors in 
the dialog box to the number of desired eigenvalues. The eigenvalues 
as displayed by StatView II are of no great value in the final 
interpretation of the factor solution. They are displayed for purposes 
of completeness and for those who wish to address subjectively the 
number—of—factors question. 


Eigenvectors 


Vector 1 Vector 2 Vector 3 Vector 4 







height 398 [28-301 [107 
arm span 






2605] 297___f-.145_f.124 


forearm leng... 






lower leg len... 







weight 351-394 [-.213_f-ti4 
bitrochanter... [-.312__|-.401__|-.073_[-.713_ 






[n.286 |-.436_[-.421_ 63 
omits [4314 | 853} 22) 


Like the eigenvalues, the eigenvectors are included in the display for 
purposes of completeness. The eigenvectors are computed with the 
eigenvalues. For every eigenvalue, there is an associated 
eigenvector. The eigenvectors are not used when interpreting the 
final factor solution. 


chest girth 
chest width 





Unrotated Factor Matrix 


Factor 1 Factor 2 


height 

erm span 
forearm lend... 
lower leg len... 
weight 
bitrochanter... 
chest girth 
chest width 
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Once the number of factors have been determined, it is necessary to 
determine the correlation of each vanable with each factor, a 
structure value typically referred to as a loading. Most modern day 
factor analysts view this initial, unrotated factor matrix as the initial 
step in determining a desirable factor solution matrix. The square of 
a structure value represents the proportion of variance of the variable 
associated with the row that can be predicted by the factor associated 
with the column. 


Computing the sum of the squared structure values by row results in 
a proportion, the final communality estimate, that represents the total 
proportion of variance of the variable that can be predicted by the 
factors. 


Communality Summary 
SMC Final Estimate 
height 
arm span 
forearm leng... 
lower leg len...|. 
weight 
bitrochanter... 
chest girth 
chest width 





Prior to a factor analysis, the total proportion of vanance of a 
variable that is predictable is estimated by the squared multiple 
correlauwi of the variable. Both the communality estimates and the 
squared multiple correlations (SMC) are reported in the communality 
Summary table. Some analysts prefer to think of the squared multiple 
correlations as the initial communality estimates, while others prefer 
to think of the largest off-diagonal entry associated with the variable 
as the initial communality estimate. In the situation where a singular 
(determinant equal to 0) correlation matrix is analyzed, the initial 
communality estimate is assumed to be 0. 


For the eight physical variables, it can be seen from the communality 
Summary table that approximately 82 percent of the variation in 
height is predictable in a linear regression equation using the other 
seven vanables. This conclusion is derived from the squared multiple 
correlation of height. When two factors are used to predict height, 
approximately 88% of the variation is predictable, an improvement 
of approximately 6%. 


On occasion, StatView II reports a final communality estimate that is 
slightly larger than 1. When this occurs, the associated variable is 
referred to as a Heywood case. Generally, the associated factor 
analysis will not be flawed. For further information on the Heywood 
case, the interested user is referred to the Gorsuch book. 
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Orthogonal Transformation Solution-Varimax 


Factor 1 Factor 2 







height ce a ee 
armspan = [93.195 






919 | | .i64 


forearm leng... 






lower leg len... 







weight 
bitrochanter... |.181 |84 | 






When interpreting a factor solution (attempting to name the factors), 
itis substantially easier to deal with solutions where the variables 
have high loadings on just one factor or 0 loadings on most factors (a 
simple structure). Simple structure is best achieved by allowing the 
final factors to be correlated with each other: an oblique solution. 
Some analysts prefer solutions in which the factors are uncorrelated: 
orthogonal solutions. StatView II uses a general orthogonal 
transformation procedure to compute an orthogonal solution. The 
general solution algorithm provides a choice of either the varimax 
orthogonal solution, the equamax orthogonal solution, or the 
quartimax orthogonal solution. With only a few exceptions, the 
contributions of the factors to a factor solution: 


chest girth 
chest width 





¢ are evenly distnbuted across all factors with an equamax 
solution 


* tend to be concentrated on just a few factors with the 
Quartimax solution 


* are neither of the two extremes just noted when defined as a 
varimax solution 


For the eight physical variables, the StatView II] default, a varimax 
transformation solution, was determined as the orthogonal solution. 
Notice that for this solution, the first four vaniables show large 
positive loadings on correlations with the first factor, while the last 
four variables show large positive loadings on the second factor. The 
correlations of the first four variables with the second factor are quite 
low. Similarly, the correlations of the second four variables with the 
first factor are quite low. However, many of these low correlations 
are none-the-less statistically significant. An oblique solution, 
correlated factors, will reduce the magnitude of these low loadings. 
Small, but statistically significant, loadings in an orthogonal factor 
solution are suggestive that the solution might be better described as 
an oblique solution. 


Chapter 5 — The Compare Menu = 205 


206 


Oblique Solution Primary Pattern Matrix-Orthotran/Varimax 


Factor | Factor 2 


height 
arm span 
forearm leng... | - 
lower leg len... |. 
weight 
bitrochanter... 
chest girth 
chest width 





When determining an oblique solution, StatView II uses an 
algorithm that simply takes a given orthogonal solution and releases 
the restriction of orthogonality. The algorithm, the orthotran 
solution, always defines a simple structure solution that is good as or 
better than the associated orthogonal simple structure solution. 


There are two types of oblique solution in factor analysis a primary 
pattern solution and a reference structure solution. StatView II 
determines both types of solution. These solutions are quite similar; 
indeed, one is a column rescaling of the other. The pattern solution 
defines loadings that are regression coefficients for predicting the 
standard score from of a variable in terms of the defined factors. The 
reference structure solution defines loadings that are correlations. 
Both solutions have good simple structure in the sense that the high 
loadings are high, and the low loadings are near 0. 


Oblique Solution Reference Structure-Orthotran/Yarimax 


Factor 1 Factor 2 


height 

arm span 
forearm leng... 
lower leg len... 
weight 
bitrochanter... 
chest girth 
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When comparing a primary pattern solution to a reference structure 
solution, it is immediately apparent that the large loadings are larger 
in the primary pattern solution. Sometimes these primary pattern 
values become larger than 1, simply because they are regression 
weights. Regardless of whether you use a primary pattern or 
reference structure solution, the conclusions should be the same. For 
the eight physical variables, it is clear that the first four variables are 
associated with the first factor and not at all associated with the 
second factor. Using similar logic, it is apparent that the second four 
variables are associated with the second factor. If we were to attempt 
to name the factors, we would attempt to name the first factor so that 
it represented the essence of the variables loading on it. The first 
factor could be named done structure. The last four variables all deal 
with flesh and the second factor could be named flesh factor. 
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Clearly, for these data we would have arrived at the same factor 
name if we had attempted to interpret the factors from an orthogonal 
solution. Is it reasonable to assume that body weight or flesh 1 1S 
independent of bone structure? If your response is “yes,” then you 
might be satisfied with an orthogonal solution. If, however, you 
assumed that taller people are in general heavier and fleshier, than 
shorter people, then you would be Sausfied with an oblique solution. 


Primary IiIntercorrelations—Orthotran/Varimax 


Factor 1 Factor 2 


ee (Si es SNe 
ors a 


When utilizing an oblique solution, you are obligated to define the 
intercorrelations of the factors. Regardless of whether you are using 
a primary pattern or reference structure, it is the primary factor 
intercorrelations that are reported. For the eight physical \ iriables, 
there 1s only one correlation, the correlation between the flesh factor 
and the bone factor. According to the StatView I solut’ n, the 
correlation between these two factors for this particular dataset is 
.503. Typically, you do not define oblique solutions with factor 
intercorrelations much larger than .50. If you were to have large 
factor intercorrelations, say .70 or higher, this would suggest that the 
factor solution may have been under- “factored: that more factors 
should have been extracted in the initial solution. 


Variable Complexity-Orthotran/Varimax 


Orthogonal Oblique 






height 1.166 |i.oos— 
arm span 1.088 11.005 | 






1.063 1.013 


forearm leng... 










lower leg len... 






weight — 2 ee 
bitrochanter... |1 .092 






chest girth |1.032 
chest width [1.221 1006 | 


Average y.1g9 1077 





Variable complexiry refers to the factor density of a variable. Usually 
for ideal simple structure, which you never have with an unrotated 
initial solution, each variable is accounted for by no more than one 
factor. When you have this ideal simple structure, the average 
variable complexity will be 1. That is, on the average, each variable 
is defined by no more than one factor. 


To the extent that simple structure is not achieved, each variable is 
defined by more than one factor, and the average variable 
complexity will be greater than unity. The average variable 
complexity of an oblique solution will always be less than the 
average variable complexity of an orthogonal solution. 
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For the eight physical variables it is apparent that the average 
variable complexity for both orthogonal and oblique solutions is low. 
The variables height, lower leg length, weight and chest width are a 
bit more factorially dense than the other variables in the orthogonal 
solution. Notice how the oblique solution has reduced the 
complexity of these variables. This reduction in complexity may be 
verified by comparing the orthogonal solution with the oblique 
reference structure solution. The oblique solution is a near perfect 
simple structure solution. 


Proportionate Yariance Contributions 


Orthogonal Oblique 
Direct Direct Joint Total 


Factor 1 = |.543 
Factor 2.457 eS ae a 


Factors make different proportionate contributions to the common or 
explained variance. The greater the proportionate variance 
contribution of a factor, the more important the factor is in terms of 
explaining the intercorrelations of the variables. The direct 
proportionate contribution of a factor, regardless of whether you are 
looking at an orthogonal or an oblique solution, represents the 
proportion of the common variance that the factor accounts for 
independent of the other factors. The joint proportionate 
contribution of a factor is only defined for an oblique solution since 
it deals with shared variance or variance that is common to more 
than one factor. At times, you may note trivial negative joint 
contributions. Such negative contributions tend to be negligible. It is 
the positive joint proportionate contribution (such as the .20 in the 
eight physical variables example) that is of interest. 


This suggests that 20% of the common variance may be attributable 
to the covariation of the two factors. When looking at the 
proportionate variance contributions of the oblique factors relative to 
the orthogonal factors, we see that for the eight physical variables 
data, the bone factor accounts for a greater proportion of the 
common variance than the flesh factor, but it accounts for 
substantially more variance (64%) as an oblique factor. Of that 64%, 
34% is accounted for independently of the other factor. 


Factor Score Weights for Oblique Transformation Solution-Orthotran/Varimax 


Factor 1 Factor 2 


height 

arm span 
forearm leng... 
lower leg len... 
weight 
bitrochanter... 
chest girth 
chest width 
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The final display provided by the StatView II factor analysis is a 
display of factor score weights. If you wish to convert the original 
variable observations to standardized factor scores (for the eight 
physical variables data, a flesh score and a bone score), the columns 
of the factor score weight matrix may be thought of as the 
Standardized regression weights for converting the eight physical 
variables in standard score format to two standard scores, one for 
flesh and one for bone. If the oblique factor score weights are used, 
then the intercorrelations of the factor scores is precisely and exactly 
503, as defined by the factor intercorrelation matrix. 


Factor Score Weights for Orthogonal Transformation Solution-Varimax 


Factor 1 Factor 2 


height 

arm span 
forearm leng... 
lower leg len... 
weight 
bitrochanter... 
chest girth 
chest width 





If the orthogonal factor score weights are used, then the 
intercorrelation of the factor scores is exactly .00. Whether you use 
orthogonal or oblique factor scores is really a matter of personal 
preference. If you want to interpret an oblique solution, then the 
oblique factor scores should be used. 


Graphic Views 

StatView II provides three plots: those associated with the unrotated 
factor solutions, those associated with the orthogonal solution and 
those associated with the oblique solution. Within any particular set 
of plots, all pairwise factor plots are presented. 

Unrotated Solution 


*> Choose Scattergram from the View menu. 


Unrotated Orthogonal Plot: Factor 1 vs. Factor 2 


essveesacsnenorssnssnncssessosensses encenssesssuesuee soesneeg cos semsemvecessresseesnesnesnesensensenesesecnsereee * © F ec t or 
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The plot of the unrotated solution allows you to make a quick 
judgment regarding the potential simple structure of the factor 
solution. For the eight physical variables problem, two distinct 
Clusters of points are apparent in the unrotated plot. An ideal factor 
solution for the variables, from a simple structure perspective, would 
have one axis passing through the cluster of variables | through 4 in 
the upper right hand quadrant, and the other axis passing through the 
other cluster. If the data were under-factored (which is not possible 
with the eight physical variables), then you might see points 
scattered throughout all four quadrants with no definitive clusters of 
points. If the data were over-factored, you would see many points 
near the point of intersection of the two axes, and perhaps one or two 
points defining a cluster. 


Orthogonally Rotated Solution 


*> Click the down arrow in the scroll bar. 


Rotated Orthogonal Plot: Factor 1 vs. Factor 2 





Factor 2 
ur Ga kb ATA eS & 


. J! | ' 
o no 


“| =Smr-5 ) meee: GC Pe 4 6 8 1 
Factor 1 


The plot of the orthogonally rotated solution should show the axes 
either near the clusters or passing through them. If the axes pass 
through the clusters, then an orthogonal solution, uncorrelated 
factors, is appropriate for the data. If the axes are near, but not 
through, the clusters (as they are for the eight physical variables), 
then an oblique solution is most appropriate. A dataset that has been 
over-factored is apparent in the plot of the orthogonal solution since 
most of the plotted variables will be right at the origin and small 
clusters of a few variables will be near an axis. A plot of an under- 
factored orthogonal solution will be similar to the plot of the 
unrotated solution. 


Oblique Solution 


*> Click the down arrow in the scroll bar. 
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ANOVA 


Transformed Oblique Plot: Factor 1 vs. Factor 2 


© Factor 


Factor 2 





5 lire Bio=R*> 65 SA. C m3 6 6 8B 1 


come Primary 1 ——Primary 2 


Factor | 
The plot of the oblique solution should show the oblique axes, 
primary axes, passing through the clusters of points as they do for 
the eight physical variables. Notice that the plotted primary axes are 
not a right angles. This is because the axes are correlated. In this 
example, the simple structure of the oblique solution is quite good. 
This is because the primary axes pass directly through the clusters. In 
a case where the orthogonal solution passes axes through the clusters 
it may be noted that the oblique solution and the orthogonal solution 
are identical and the factor intercorrelations are 0. 


StatView II can calculate an Analysis of Variance for Factonal or 
Repeated Measures Models. Factorial experiments may have up to 
16 grouping factors and need not be balanced (that is, cell frequency 
does not have to be the same throughout the model). In fact, 
StatView II can handle factorial models with missing cells if the 
model is connected. All grouping factors must be fixed factors for 
results to be meaningful. 


The following restrictions hold for StatView I ANOVA. 


¢ There can be no more than one repeated measures factor, also 
referred to as a within factor. Subjects can not be measured 
more than once unless the measure is the single repeated 
factor. 


« StatView II only handles complete factorial designs with no 
replications — no split-plots, no Latin squares, and no 
analysis of covariance. 


« Repeated measure experiments with one within-factor and up 
to 15 grouping factors can be analyzed. 


* Models with a single grouping factor and one within-factor 


can have unequal cell size; however, models with 2 or more 
grouping factors and one within-factor must be balanced. 
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For unbalanced data, a Yates Weighted Means Squares 
analysis is performed. 


¢ The grouping factor for the Repeated Measure Models must 
be fixed. 


¢ Multiple comparison tests are only available for the single 
factor analyses. 


¢ StatView II ANOVA expects all data for a subject to be ina 
single row. You can recognize a repeated factor because you 
will have multiple columns for the response, or Y, variables. 


If you are not sure whether StatView II can solve your model design, 
take an example of this design from a text book and enter the values 
into StatView II and if it solves with the proper solution. 


Multiple comparison tests (Scheffé, Fishers PLSD, Dunnet t) are 
performed for all single factor models. The significance level for the 
Dunnet is not reported by StatView II. In order to interpret the 
Dunnet you must check a Dunnet table. Such tables are found in 
most statistics books that treat the topic of multiple comparisons (see 
Winer p. 874). Please note that StatView II computes significance 
levels for Fishers PLSD even when the overall significant level for 
the factor does not meet your significance level. Do not report 
significant mean differences unless your overall F test is significant. 


‘> Choose ANOVA from the Compare menu. The following dialog 
box is displayed: 


Analysis of Variance Parameters: 
Multi-comparision significance level for 


one factor Anouva: 


Experiment Type: 


@ Factorial 
© Repeated Measures 





If you are analyzing a single factor model, enter the significance 
level for the multiple comparison tests in the text entry rectangle. 
StatView II will default to a significance level of 95%. 


Choose whether the experiment type is factorial or repeated measure. 


Specify the experiment design through X and Y variable 
assignments. The number of grouping factors is determined by the 
number of X variables. The X vanables must be Category or Integer 
columns. The dependent variable is specified via the Y variable. For 
repeated measure experiments, the within—factor is specified with 
multiple Y variables. 
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To facilitate your understanding of StatView II ANOVA, the 
examples below illustrate different structural models. This 
discussion uses several datasets: 

¢ Lipid Data 

¢ Winer 2 Factor Balanced (Winer, p. 437) 

* Afifi & Azen 2 Factor Unbalanced (Afifi and Azen, p. 166) 

¢ Winer One Factor Repeated Measures (Winer, p. 268) 


¢ Winer Three Factor Repeated Measures (Winer, p. 525) 


Single Factor Factorial, Non-Repeated Measure 

Since this example is computing a single factor ANOVA, assign one 
X variable to the single grouping factor. The Y variable specifies the 
dependent data column. 


A single factor factonal design ANOVA is a OneXOneY statistic. A 
result is computed for each dependent Y variable assigned. 


*> Open the Lipid Data dataset. 


The figure below illustrates the dataset layout for this Single Factor 
Factorial, Non-Repeated Measure model. 








Independent variable Dependent variable 
Heart Attack Height 
X1 Yl 
none Height values subject ] 
— none for each subject subject 2 
n 
none subject 64 
subject 65 
Group = 
wwe subject 72 
subject 73 
Group = 
three subject 79 
G over 6U 
we i subject 80 
over 60 subject 95 





It displays an analysis of four groups on a single dependent variable. 
The dependent variable, Height, is represented as a single data 
column and should be assigned as Yy. A second column should be 
the grouping column, in this example Heart Attack. This grouping 
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column will, by row, note the group to which the associated 
dependent variable belongs. The grouping column should be 
assigned as Xj. 

*> Assign a Y to Height. Assign an X to Heart Attack. 

*> Choose ANOVA from the Compare menu. 


*> Enter 95 as the significance level for post hoc means 


comparisons. Note: This corresponds to an alpha level of .05. 


*> Click Factorial. 
*> Click OK. 


*> Choose Table from the View menu. 


One Factor ANOVA X11: Heart History Y 1: Height 


Analysis of Variance Table 


Sum Squares: Mean Sauare: F-test: 


les 85.067 28.356 1.674 
por sal 2 et Bee 16.941 |p = 1782 
gg. <) lpgRe bes Pea: 


Model Il estimate of between component variance = 3.805 


Source: 











Between groups 





Within groups 





The view title specifies the experiment design and the X and Y 
variables. The ANOVA table includes: 


¢ the between groups degrees of freedom, sum of squares, 
mean square 


¢ the within-groups degrees of freedom, sum of squares, mean 
Square 


¢ the total degrees of freedom, sum of squares 


¢ F statistic and probability 


For a single factor, the Model II estimate of between component 

variance is computed. This is an unbiased estimate for the vanance 
of the population of differential effects due to the single factor. (Afifi 
& Azen, p.215). 


Note that the ANOVA is not significant at the .05 level. If we had 
chosen the .20 level, the ANOVA would have been significant. 


*> Click the down arrow in the scroll bar to display the table of cell 


means. 
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One Factor ANOVA Xj: Heart History Yj: Height 





For each group the following summary statistics are provided: 
* name 
* count of cases 
¢ mean 
¢ standard deviation 
¢ standard error of the mean 


If there are more than five groups, the table is continued on 
SUCCESSIVE pages. 


This table should always be examined to determine the number of 
data values in each cell of the design as well as cell means and 
Standard deviations. 


*> Click the down arrow in the scroll bar to display the table of post 
hoc means tests. 


One Factor ANOVA X11: Heart History YY 1: Height 


Comparison: Mean Diff.: Fisher PLSD: Scheffe F-test: Dunnett t: 

















none ys. <SO ’ 
ves. e0 ame ideas iow ian — 
enevs.ever co fees ~—Cidzaes id SC~«id SSCS 
[sovreso—iarae ieee ifs id 
[sovsevereo—-faeee_(eser—‘[vse —*iisoe 


* Significant at 95% 






Comparisons between treatment means are provided in the last table. 
For each treatment comparison the following information 1s 
provided: 


« difference between the means 


¢« Fisher's PLSD test 


Scheffé F-test 


¢ Dunnett t-test 
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If the Fisher and Scheffé test are significant at the level entered in the 
dialog box, an asterisk (*) appears by the comparison value. No 
significance is computed for the Dunnett test. 


Notice that the none vs. < 50 difference is marked significant in the 
Fisher PSLD column. You must not interpret this as a significant 
difference because the F test from page one was not significant. The 
P in PSLD stands for protected. The protection results from not 
interpreting the PSLD column unless the F test is significant. If you 
go back to the ANOVA choice from the Compare menu and select 
80 for the Multi-comparison significant level, the PSLD values will 
correspond to an alpha of .20. While almost no one would ever use 
an alpha of .20, in this example the Fisher's PSLD values would now 
be appropriate to interpret. 


There are no graphic views available. 


Two Factor Factorial, Non-Repeated Measures — 
Balanced Model 


When analyzing a two factor ANOVA, assign X variables to each of 
the two grouping factors. The Y variable specifies the dependent 
data column. A Two Factorial design is a ManyXOneY statistic. A 
result is computed for each dependent Y variable assigned. 


*> Open the Winer-Two Factor Balanced dataset. 


This data is from Winer (1971, p. 436). Winer describes the 
experiment as follows: The experiment evaluates the effectiveness of 
three drugs in bringing about behavioral changes in two categories of 
patients. A random sample of nine patients belonging to the first 
category (schizophrenics) is divided at random into three groups, 
with three patients in each subgroup. Each subgroup is assigned to 
one of the drug conditions. The same procedure is followed for the 
nine patients belonging to the second category (depressives). The 
rating column contains criteria ratings made on the patient before 
and after the administration of the drugs. The figure below illustrates 
the layout of the data. 


Independent variables Dependent variable 
Category of Patient Drug Rating 

Yi 

Rating Subject | 

values Subject 2 

for Subject 3 

subjects Subject 4 

one Sudject 5 

subject Subject 6 
Subject 7 

TOW, 

~~ Subject 8 
Subject 9 
Subject 10 
Subject 11 
Subject 12 
Subject 13 
Subject 14 
Subject 15 
Subject 16 
Subject 17 
Subject 18 
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For this example we have two independent variables. The first 
independent variable, Category of patient, has two levels: 
schizophrenics and depressives. The second independent variable, 
drug, has three levels: drug a, drug b, drug c. This model will 
require three columns of data: two grouping columns and a 
dependent variable column. There is a unique group for each 
crossing of the levels of the independent variables. This is a balanced 
design because all groups have equal sample sizes. If any pair of 
groups had unequal sample sizes, it would be an unbalanced design. 


The first independent variable column, Xj, will be comprised of four 
unique entries, one for each unique level. The second independent 
variable column, X92, will be comprised of two unique value, one for 
each of the two levels. The third column, Yj, will be the dependent 
variable column. Reading row wise for a subject the first entry will 
denote the individual's group on independent variable 1; the second 
entry for the row will denote the individual's group on independent 
variable 2. The third entry for the row will denote the individual's 
observed value on the dependent variable. 


*> Assign a Y to Rating. Assign an X to category of Patient and 
Drug. 


> Choose ANOVA from the Compare menu. 

*> Click Factorial. 

*> Click OK. 

*> Choose Table from the View menu. 

Two factor and higher ANOVA calculations present a dialog box 
which specifies the number of factors in the model and indicates the 
progress of the computation. Note that multi-factor ANOVAs often 


take a long time to compute. 


Anova table for a 2-factor Analysis of Variance on Y1 : Rating 


Source: df: Sum of Squares: Mean Square: F-test: P value: 


Category of Patient(../1 [18 2038789 
bre (B) ee. ae Se: a ee 












12 144 72 8.151 0058 
Me 2 a CF a a me 





There were no missing cells found. 


The view title names the experiment design and the Y variable 
analyzed. The note on the bottom of the page indicates whether any 
missing cells (cells with no observed values) were found. The 
ANOVA table includes: 


¢ for the main effect of Factor A: the degrees of freedom, sum 
of squares, mean square, F-ratio and probability value 
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¢ for the main effect of Factor B: the degrees of freedom, sum 
of squares, mean square, F-ratio and probability value 


¢ for AxB interaction: the degrees of freedom, sum of squares, 
mean square, F-ratio and probability value 


* for Model Error: the degrees of freedom, sum of squares, 
mean square 


*> Click the down arrow in the scroll bar. 


The AB Incidence table on Yq: Rating 


bi by 3 3 

schizophre... 4 6 

depressives . ; 

rF . 2 12 
6 18 


On the pages following the ANOVA table, StatView II provides the 
incidence table. The incidence table details the count and mean of 
each cell found in the analysis. The cell count is the top value in each 
cell; the cell mean is the lower value. Each row and column of the 
incidence table is provided with that row or column's total count and 
mean. This particular analysis contains six cells. Note that factor A 
(X1) is always the vertical component of the incidence table and any 
other factors are displayed horizontally. 






category 






There are no graphic views available. 


Two Factor Factorial, Non-Repeated Measures — 
Unbalanced Model | 


The layout for this model is the same as in the preceding Two Factor 
Factorial, Non-Repeated Measures example. 


*> Open the Afifi & Azen - Two Factor Unbalanced dataset. 


This data is from Afifi & Azen (1972, p. 166). The experiment 
evaluates the increase in systolic blood pressure resulting from four 
different treatments for three different diseases. 


The two factors in the experiment are treatment and disease. The first 
factor in the experiment, factor A, is specified by the column 
Treatment. This is a category column containing four groups: Drug 
A, Drug B, Drug C, and Drug D. The second factor in the 
experiment, factor B, is specified by the Disease column. This is a 
category column containing three groups: Disease 1, Disease 2, and 
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Disease 3. The blood pressure data is in the column Systolic 
Pressure. 


This design is unbalanced because the pairs of groups have unequal 
sample sizes. 


*> Assign a Y to Systolic Pressure. Assign an X to Treatment and 
Disease. 


*> Choose ANOVA from the Compare menu. 
*> Click Factorial. 

«> Click OK. 

¢> Choose Table from the View menu. 


Unbalanced models take longer time to evaluate than balanced 
models. A dialog box informs you on the progress of the analysis. 


Anova table for a 2-factor Analysis of Variance on Y}: Systolic Pressure 


Source: df: Sum of Squares: Mean Square: F~test: P value: 


Treatment (A) I 2987 .472 999.157 0001 






Disease (B) 415.873 207.937 ELE 
ie. aa 707.266 117.878 ; 





Error 86 | 5080.817 110.453 caine 





There were no missing cells found. 


The view title specifies the experiment design and the Y variable 
analyzed. The ANOVA and Incidence table have the same format as 
the balanced two-level ANOVA. 


The AB Incidence table on Yj: Systolic Pressure 


Disease |_| Disease 2 


nice 6 PI 5 15 
9 29.333 28.25 20.4 26.067 
5 7 15 
3 4 12 
Drug D . " 
7 12.833 14.2 
ee 1S 19 20 56 
is ea 22.789 18.211 15.8 18.879 


There are no graphic views available. 











Treatment 
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Single Factor Factorial — One Repeated Measure 


Experiments in which the same subject is observed under each 
treatment are repeated measure experiments. /n StatView II, the 
single factor repeated measures ANOVA is set up differently than the 
multiple factor repeated measures models. For single factor models, 
each treatment is assigned an X variable. It is a ManyX statistic. One 
result is computed using all X variables. 


*> Open the Winer 1 Factor Repeated Measures dataset. 

This data is from Winer p. 268. Winer describes the experiments as 
follows: The experiment studies the effects of four drugs upon 
reaction time to a series of tasks. Each subject was observed under 


each of the drugs. Each column reflects the mean reaction time of a 
subject to the series of tasks. 


Dependent variables 
Drugl Drug2 Drug3 Dmg 4 

X1 X2 X3 x4 

Four values in each row, Subject 1 

one subject per row. 
Subject 2 
Subject 3 
Subject 4 


Subject 5 


Note: For the single factor model only, the repeated measures factor 
‘1S indicated by X variables. This example has four X variables 
assigned, indicating each subjects four measures on the dependent 
variable. Each subject will have four row entries involved in the 
analysis. 

*> Assign an X to each of the four drug columns. 

*> Choose ANOVA from the Compare menu. 

«> Enter 99 as the significance level. 

*> Click Repeated Measures. 

*> Click OK. 


*> Choose Table from the View menu. 
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One Factor ANOYA-Repeated Measures for Xj —. Xq 






Source: df: Sum of Squares: Mean Square: F-test: P value: 
Between subjects 6 680.8 170.2 3.148 — | .0458 
Within subjects 1S {811 54.067 Series aa, 







[treatments 0/3 698.2 232.733 [24.759 [0001 | 
LRA NS i A AA ia ERTS sme tiial 
OT a i iT es Sa a 


Reliability Estimates for- All treatments: .682 Single Treatment: 349 





The view title specifies the experiment design and the number of X 
variables. The ANOVA table includes: 


* the between-subjects degrees of freedom, sum of squares, 
mean square 


¢ the within-subjects degrees of freedom, sum of squares, mean 
square 


¢ the treatments degrees of freedom, sum of squares, mean 
square, F value and probability 


¢ the residual degrees of freedom, sum of squares, mean square 
¢ the total degrees of freedom, sum of squares 


Reliability estimates are computed for the mean of all treatments and 
for a single treatment (Winer, p. 283). 


«> Click the down arrow in the scroll bar. 


One Factor ANOVA-Repeated Measures for Xj —. X4 


Group : Count: Mean: Std. Dev.: Std. Error: 





For each treatment the following summary statistics are provided: 
* name 
* count of cases 
¢ mean 
¢ standard deviation 
¢ standard error of the mean 
Note that mean for Drug 3 1s significantly lower than the other 


means. This indicates that Drug 3 is associated with the fastest 
reaction. 
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If there are more than five treatments this table is continued on 
SuCCeSsSive pages 


*> Click the down arrow in the scroll bar to display the table of post 
hoc means tests. 


One Factor ANOY A-Repeated Measures for Xj .. X4 


Comparison: Mean Diff.: Fisher PLSD: Scheffe F-test: Dunnett t: 


ania ae a 
pragivs.ongs [iow _|sszse [roses [sav 
peri ees Sides alt iaus iia aan 
ree er LATIN CLS CC 
iceman tas a ie ee 


® Significant at 99% 











Comparisons between treatment means are provided in the last table. 
For each treatment comparison the following information is 
provided: 


¢ difference between the means 
¢ Fisher's PLSD test 

¢ Scheffé F-test 

¢ Dunnett t-test 


If the Fisher and Scheffé test are significant at the level entered in the 
dialog box, an asterisk (*) will appear by the comparison value. No 
significance is computed for the Dunnett test. See the discussion of 
this table for the single factor - non-repeated measures example for 
more information. 


There are no graphic views available. 


Three Factor Factorial — One Repeated Measure 


Multiway repeated measures models are experiments where there are 
repeated measures on one factor (the within-subjects factor) and one 
or more grouping factor. Grouping factors refer to effects that occur 
between groups while within-subjects factors refer to effects 
measured by differences within subjects. 


For models containing two or more factors, the repeated measure 
(within-subjects factor) is specified by two or more dependent Y 
variables. The grouping factors are specified by X variables. If there 
are two or more grouping factors the data must be balanced. 


A multi-factor repeated measure design is a ManyXManyY statistic. 
One result is calculated from all the X-Y variables. 


*> Open the Winer - Three Factor Repeated Measures dataset. 
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This data is from Winer (1971, p. 564). The experiment evaluates the 
effect of anxiety and muscular tension on a learning task. Each 
factor has two levels. The experiment is repeated four times for each 
subject (Trial 1 through Trial 4) with the number of errors recorded. 


Independent variables Dependent variables 

Anxiety Tension Trial 1  Trial2 Trial3 Trial 4 

1 x2 Y1 Y2 be. Y4 

bs Four values in each row, Subject ] 
gn 


one subject per row. Subject 2 
Subject 3 


Subject 4 
Subject 5 
Subject 6 


Subject 7 
Subject 8 
Subject 9 


Subject 10 
Subject 1] 
Subject 12 


X 


This model has three independent variables, one of which is a 
repeated measures factor. There are only three groups associated 
with this model, but there will be six columns involved in the 
analysis, two X columns and four Y columns. The first X column 
will be associated with independent variable Anxiety, X41, and will 
have two unique values, Low and High, one for each level of the 
variable. The second X column will be associated with independent 
variable Tension, X32, and will have two unique values, None and 
Fgh, one for each level of the variable. The third independent 
variable (the repeated measures factor) will have four Y columns 
associated with it, one for each subject's four measures on the 
dependent variable, Yy.Y2, Y3, and Yq. Each subject will have six 
row entries involved in the analysis. 





*> Assign Y variables to the four Trial columns. Assign an X to 
Anxiety and Tension. 


*> Choose ANOVA from the Compare menu. 
*> Click Repeated Measures. 

*> Click OK. 

‘> Choose Table from the View menu. 


The dialog box keeps you up to data on the progress of the analysis. 
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Aneva table for a 3-factor repeated measures Anova. 


SOr=e: df: Sum of Squares: — Sauare: F-test: P value: 


[Anxiety (A)__|t 10.088 ——+4'0.ces__——~seve | 3817 
FFeeston (8) as Jas 0 909 
a CN (2 A ES 
Cc ECS {yh a 
COR “oe 330.5 152.051 9001 
FSS a ee 
ee OE ee | a a oe 
EL Renee SE ES RE 
[Cx subjects w. groups|24 52.167 2.174 


There were no missing cells found. 













The view title specifies the experiment design . 
The ANOVA table contains: 


¢ the sum of squares, mean square, degrees of freedom, F-test 
and p value for all the between subjects vanation (in this 
example, these are the A, B and the AB interaction) 


¢ the Subjects within Groups’ sum of squares, mean square, and 
degrees of freedom. Note that this line indicates the error for 
the between factors A & B and serves as the denominator for 
the between subjects’ F-tests 


¢ the sum of squares, mean square, degrees of freedom, F-test 
and p value for all the within subjects variation (in this 
example, these are C, AC, BC, and ABC) 


* the C x Subjects within Groups’ sum of squares, mean square, 
and degrees of freedom. Note that this line indicates the error 
for the within factors and serves as the denominator for the 
within subjects’ F-tests. 


The note on the bottom of the page indicates whether or not any 
missing cells (cells with no observed values) were found. 


*> Click the down arrow in the scroll bar. 


Just as in the factorial model, the ANOVA table is followed by the 
incidence table. Since this example has three factors, StatView I 
displays four different incidence tables to give a comprehensive 
breakdown of the data. The tables given are the AB, AC, BC and 
ABC cell counts and means. Just as with the factorial model, the first 
factor alphabetically is the horizontal component of the tables. For 
tables AB, AC & ABC, this is factor A (X1) and for table BC this is 


factor B (X72). Each table is clearly labelled. 


There are no graphic views available. 
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ee ontingency Table A Contingency table computes the following: 


¢ The Chi-Square Goodness of Fit comparing the observed 
frequency of a single group sample with the expected 
frequency 


¢ Formation of Contingency (frequency, cross-tabulation) 
Tables with summarizing statistics. The statistics include: 
Total Chi-Square, G Statistic, Contingency Coefficient, 
Cramer's V, Phi (2x2 only), Chi-Square with Continuity 
Correction (2x2 only). The Contingency tables can be 
tabulated from coded data or input as previously tabulated 
data. 


StatView II uses a relatively new procedure for identifying the cells 
of a contingency table responsible for a significant chi square, 
referred to as the post hoc cell contributions. For each cell of a 
contingency table associated with a chi square an adjusted residual is 
computed. This adjusted residual is determined form a standardized 
residual and 1s approximately normally distributed with a mean of 0 
and a standard deviation of 1. Thus, an adjusted residual of 1.96 
suggests that the deviation of the cell observed frequency from the 
cell expected frequency is significant at the .05 level. Usually a cell 
significance only accompanies a significant chi square. 


*> Choose Contingency Table from the Compare menu, and the 
Contingency Table dialog box appears. 


Select Test: 
© Chi-Square (One Group) 
@ Contingency Table analysis 


---Contingency Table Information--- 


The contingency table size is determined by 
your data. 


Select the source for contingency table data 
© Coded Raw Dats (Y-rows, H-columns) 
@ Tabulated Data (fram data window) 
Additional Frequency Tables for display: 

Xi Row % ] Expected Values 

& Column % &] Post-hoc Contributions 





tw 
to 
Uy 
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The dialog box allows you to select which type of test you are 
performing. 


Choose Chi-Square (One Group) if you are comparing the 
frequency observed in a sample with the expected frequency derived 
from a theoretical model. 


Choose Contingency Table if you are analyzing two variables 
which are classified into a number of categories or attributes. If you 
are analyzing Contingency table data you need to specify additional 
information. 


Data may be in coded raw form. In this case the program tabulates 
the contingency table for you. Data may be already be in tabulated 
form in the dataset. The program reads the tabulated data. 


All contingency table analyses include a summary page of statistics 
and the observed frequency table. You may also choose to display 
four additional tables: 


* row percents 
¢ Column percents 
* expected values 


* post hoc cell contributions displaying the adjusted residual 
for each cell in the contingency table. 


Chi-Square — One Group 


The Chi-Square test compares the observed frequencies, located in 
an X column, to expected frequencies, located in a Y column. It is a 
OneXOneY statistic. If there is a single X variable and multiple Y 
variables assigned, a result is calculated for each Y variable. If there 
is a single Y variable and multiple X variables assigned, a result is 
calculated for each X variable. If there are multiple X and Y 
variables assigned, a result is calculated for each Xj-Y; pair 


(matching subscripts). 


For this statistic, we need to create a new dataset. Assume that we 
expect the mean male cholesterol value for the patients in Lipid Data 
to be 190 and the mean female cholesterol value to be 190. The 
observed mean value for the male patients is actually 190.085 while 
the observed mean value for the female patients is 194.625. We 
create a new dataset with two columns, both type Real. The first 
column contains the observed values: 190.085 and 194.625. The 
second column contains the expected values, 180 and 190. 


*> Create anew dataset and enter the four values as described 
above. 


*> Assign an X to the first column, the observed values; assign a Y 
to the second column, the expected values. 
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*-> Choose Contingency Table from the Compare menu. 
*> Click Chi-Square (One Group). 

«> Click OK. 

«> Choose Table from the View menu. 


One Group Chi-Square 1: Observed-Cholesterol YY 1: Expected-Cholest... 


DF : Chi-Saquare : Probability : 
————ere) jai 


There are no graphic views. 


Contingency Table 


When a contingency table is to be tabulated by StatView II, select 
Coded raw data and designate the row(s) with an Y variable and the 
column(s) with a X variable. Both variables must be either Category 
or Integer columns. 


Itis a OneXOneY statistic. If there is a single X variable and 
multiple Y variables assigned, a result 1s calculated for each Y 
variable. If there is a single Y variable and multiple X variables 
assigned, a result 1s calculated for each X variable. If there are 
multiple X and Y variables assigned, a result 1s calculated for each 
Xj-Yj pair (matching subscripts). If the resulting table has more than 


8 rows or columns, you cannot compute multiple results. 

When using a previously tabulated contingency table (entered in a 
dataset), select Tabulated Data and assign X variables to the dataset 
columns which make up the contingency table columns. The rows of 
the table are the included rows of the dataset. It 1s a ManyX statistic. 
The maximum size of a contingency table is 1600 cells. 

This example determines whether a relationship exists between 
patient gender and alcohol use. The data is coded (gender as male 
and female; alcohol use as none, < 2, 2 - 6, and > 6). 

*> Open and zoom Lipid Data. 


*> Assign an X to Gender (columns of the contingency table being 
tabulated) and a Y to Alcohol Use (rows of the table). 


*> Choose Contingency Table from the Compare menu. 
«> Select Contingency Table analysis. 
«> Select Coded raw data as the source of the table. 


*> Compute and display all four additional tables: Row %, Column 
J, Expected Values, Post-hoc Contributions 
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*> Click OK. 
*> Choose Table from the View menu. 


Coded Chi-Square X1: Gender YY 1: Alocohol use 


Summary Statistics 


Total Chi-Square: 1.761 p=.6236 
G Statistic: ai ait 
Contingency Coefficient: 2. CSS ea ae aa 







Cramer's V: 2 OSS ae 


The view title specifies the X and Y variables. The first page 
contains summary Statistics for the contingency table. 





¢ degrees of freedom 

¢ total Chi-Square and probability 
¢ G statistic 

* contingency coefficient 

¢ Cramer's V or Phi (2x2 only) 


¢ Chi-Square with continuity correction and probability (2x2 
only) 


The pages after the summary contain the frequency tables. The 
maximum size table displayed on one page is 8x8. Larger tables are 
displayed on additional pages. 

Page 2 contains the observed frequency table. 


Observed Frequency Table 





male female Totale: 

none a 
<2 33 
2, = & 28 
>6 2 
Totals: 71 24 95 


Page 3 contains the percents of row totals table. 
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Percents of Row Totals 


male female 






Totals: 
Totals: 74.74% 25.26% 100% 


Page 4 contains the percents of column totals table. 


Percents of Column Totals 














male female Tetsle: 
none 33.68% 
Totals: 100% 100% 100% 
Page 5 contains the expected values table. 
Expected Yalues 
male female Totals: 
none $2 
<2 33 
2*'6 28 
>6 2 
Totals: 71 24 95 


Page 6 contains the post-hoc cell contributions. 
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Post-Hoc Cell Contributions 


male female 


There are no graphic views available. 


Non parame trics StatView II computes the following nonparametric tests: 


Two group tests: 
Mann Whitney-U 
Wilcoxon signed rank 
Spearman rank correlation coefficient 
Kendall rank correlation coefficient 
Kolmogorov-Smimov 
Wald Wolfowitz runs 
Three or more group tests: 
Kruskal-Wallis 
Friedman 


*> Choose Nonparametrics from the Compare menu. The 
Nonparametrics dialog box is displayed: 
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Nonparametric Tests 


Two group tests 
@© Mann-Whitney U 
© Wilcoxon signed-rank 
© Spearman Rank correlation coefficient 
© Kendall Rank correlation coefficient 


© Kolmogorov-Smirnov 
© Wald-Wolfowitz Buns 


3 Or more group tests 
© Kruskal-Wallis 
© Friedman 


Co) 





Specify the nonparametric test to calculate by clicking the 
appropriate button. 


Mann-Whitney U 


The Mann-Whitney U is the nonparametric version of the two group 
unpaired t-Test. It performs the test between two groups within a Y 
column. The groups in the Y column are specified by an X column 
which must be either a Category or Integer column. 


Itis aOneXOneyY Statistic. If there is a single X variable and 
multiple Y variables assigned, a result is calculated for each Y 
variable. If there is a single Y vanable and multiple X vanables 
assigned, a result is calculated for each X variable. If there are 
multiple X% and Y variables assigned, a result is calculated for each 
Xj;-Yj pair (matching subscripts). 


This example compares the Cholesterol] values of males and females 
in Lipid Data. | 


*> Assign an X variable to the Gender column. Assign a Y variable 
to the Cholesterol] column. 


*> Choose Nonparametrics from the Compare menu. 
*> Click Mann-Whitney U. 
*> Click OK. 


«> Choose Table from the View menu. 
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Mann-Whitney U &%1: Gender Yj 1: Cholesterol 


Number: = Rank: Mean Rank: 
male{71—s———3398.5 ~—_—_—_—=«d' 47.866 
female[24 61S 48396 







Th oe 2 i: ae 
rare ike “Se 
Ll SS a Ok eS 


The view title specifies the X and Y variables. For each group in the 
Y column, the top table shows: 






¢ the number of observations in the group 
¢ the sum of the ranks 
* the mean rank 
The summary statistics table shows: 
¢ the Mann-Whitney U statistic 
¢ the U-prime value 
¢ the Z value and two-tail probability 
If there are tied groups this table includes: 
¢ the Z corrected for ties and two-tail probability 
¢ the # of tied groups 


There are no graphic views available 


Wilcoxon Signed-Rank 


The Wilcoxon Signed-Rank is the nonparametric version of the two 
group paired t-Test. It compares the values of paired X and Y 
columns. 


It is a OneXOneY statistic. If there is a single X variable and 
multiple Y variables assigned, a result is calculated for each Y 
variable. If there is a single Y variable and multiple X variables 
assigned, a result is calculated for each X variable. If there are 
multiple X and Y variables assigned, a result is calculated for each 
Xj-Yj; pair (matching subscripts). 


This example compares the Cholesterol values of the patients in the 
Lipid Data to the Chol-lyr values. 


‘> Assign an X variable to the Cholesterol column. Assign a Y 
variable to the Chol-3yr column. 
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«> Choose Nonparametrics from the Compare menu. 
*> Click Wilcoxon signed-rank. 

*> Click OK. 

«> Choose Table from the View menu. 


Wilcoxon signed-rank Xi: Cholesterol Y 1: Chol-3yrs 


Number : = Rank: Mean Rank: 
“Ranks [15 lige) 4@se sss “ss 


+Ranks [26 0 [605 «28269 


note 2 cases eliminated for difference = 0. 


es sae oe p = .0237 
2 corrected for ties -2.262 p = 0237 
he.) res 


Note: 52 cases deleted with missing values. 










The top table shows the following information for both the negative 
and positive ranks. 


¢ number of each rank 
¢ sum of the ranks 
¢ mean of the ranks 


If any cases are eliminated with a difference of 0, the number is 
noted. 


The summary statistics show: 
* Zand two-tail probability 


* Zcorrected for ties (if there are tied groups in the table) and 
two-tail probability 


¢ the number of tied groups (if there are tied groups in the 
table) 


There are no graphic views available 


Spearman Rank Correlation Coefficient 


The Spearman rank correlation coefficient calculates a correlation 
based on the ranks of the values of an X and Y column. 


Itis a OneXOneY statistic. If there is a single X variable and 
multiple Y variables assigned, a result is calculated for each Y 
variable. If there is a single Y variable and multiple X variables 
assigned, a result is calculated for each X variable. If there are 
multiple X and Y variables assigned, a result 1s calculated for each 
X;-Yj pair (matching subscripts). 
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This example compares the Cholesterol values with Weight. 
*> Activate Lipid Data. 


‘> Assign an X variable to the Cholesterol column. Assign a Y 
variable to the Chol-3yr column. 


‘> Choose Nonparametrics from the Compare menu. 
*> Click Spearman Rank correlation coefficient. 
*> Click OK. 


*> Choose Table from the View menu. 


Spearman Corr. Coef. X14: Cholesterol Y1: Chol-3yrs 





Rho corrected for ties 
Z corrected for ties 4.825 p = .000! 
*X tied groups: 5 ®Y tied groups: 6 


Note: S2 cases deleted with missing values. 





The table shows the following statistics: 
* number of cases 
¢ sum of the squares of the difference of the ranks 
¢ Spearman Rho 
¢ Zand two-tail probability 
If there are tied groups the table includes: 
¢ Spearman Rho corrected for ties 
* Zcorrected for ties and two-tail probability 
* number of tied groups in the X and Y variables 


There are no graphic views available 


Kendall Rank Correlation Coefficient 


The Kendall rank correlation coefficient calculates a correlation 
based on the ranks of the values of an X and Y column. 


It is aOneXOneY statistic. If there is a single X variable and 
multiple Y variables assigned, a result is calculated for each Y 
variable. If there is a single Y variable and multiple X variables 
assigned, a result is calculated for each X variable. If there are 
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multiple X and Y variables assigned, a result is calculated for each 
Xj-Yj; pair (matching subscripts). 


This example compares the base Cholesterol values of the patients 
with their Weight. 


*> Activate Lipid Data. 


‘> 


*> 


Ao 


i 


“> 


Assign an X variable to the Cholesterol column. Assign a Y 
variable to the Chol-3yrs column. 


Choose Nonparametrics from the Compare menu. 
Click Kendall Rank correlation coefficient. 
Click OK. 


Choose Table from the View menu. 


Kendall Corr. Coef. X14: Cholesterol 4: Chol-3yrs 






[Taucorrected forties | SSS 


*xX tied groups: 5 ®Y tied groups: 6 







Note: 52 cases deleted with missing values. 


The view title names the X and Y variables. The table shows the 
following statistics: 


* number of cases 
¢ score 
¢ the Kendall Tau 


¢ Zand two-tail probability 


If there are tied groups the table includes: 


¢ the Kendall Tau corrected for ties 
¢ Zcorrected for ties and two-tail probability 


¢ the number of tied groups in the X and Y variables 


There are no graphic views available 
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Kolmogoroy-Smirnoy 


The Kolmogorov-Smirnov tests the differences between two groups 
within a Y column. The groups in the Y column are specified by an 
X column which must be either a Category or Integer column. 


It is aOneXOneY statistic. If there is a single X variable and 
multiple Y variables assigned, a result is calculated for each Y 
variable. If there is a single Y variable and multiple X variables 
assigned, a result is calculated for each X variable. If there are 
multiple X and Y variables assigned, a result is calculated for each 
X;-Yj pair (matching subscripts). 


This example compares the distribution of male and female 
Cholesterol values. 


*> Assign an X variable to Gender. Assign a Y variable to 
Cholesterol. 


*> Choose Nonparametrics from the Compare menu. 
‘> Click Kolmogorov-Smirnov. 

*> Click OK. 

*> Choose Table from the View menu. 


Kolmogoroy-Smirnov %1: Gender Yj: Cholesterol 


ee ee a 
ee 
ee ee 


The view title name the X and Y variables. The statistics include: 













¢ degrees of freedom 
¢ number of cases in each group 


* maximum difference between the two cumulative 
distributions 


¢ Kolmogorov-Smimov Chi-Square 
¢ Zand two-tail probability 


There are no graphic views available 
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Wald-Wolfowitz Runs 


The Wald-Wolfowitz runs tests whether two groups within a Y 
column have been drawn from the same population. The groups in 
the Y column are specified by an X column which must be either a 
Category or Integer column. 


It is aOneXOneY statistic. If there is a single X variable and 
multiple Y variables assigned, a result is calculated for each Y 
variable. If there is a single Y variable and multiple X variables 
assigned, a result is calculated for each X variable. If there are 
multiple X and Y variables assigned, a result is calculated for each 
Xj-Yj pair (matching subscripts). 


This example tests the male and female Weight values. 


*> Assign an X variable to the Gender column. Assign a Y variable 
to the Cholesterol column. 


*> Choose Nonparametrics from the Compare menu. 


> Click Wald-Wolfowitz Runs. 
*> Click OK. 
*> Choose Table from the View menu. 


Wald-Wolfowitz Runs K;:Gender Yj;4: Cholesterol 


TE: TORE PEA 








36.874 
Standard Deviation 3.648 
eee pe 9184 





The view title names the X and Y variables. The statistics include: 
* number of runs 
¢ number of cases in each group 
* mean 
¢ standard deviation 
¢ Zand two-tail probability 


There are no graphic views available. 


Kruskal-Wallis 


The Kruskal-Wallis test is a one-way analysis of variance by ranks. 
It tests whether 3 or more independent groups within a Y column are 
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from different populations. The groups in the Y column are specified 
by an X column which must be either a Category or Integer column. 


It is aOneXOneY statistic, however, only one X variable may be 
assigned. If there are multiple Y variables assigned, a result is 
calculated for each Y variable. 


This example uses Exercise as a grouping (X) column and the 
Cholesterol values. 


*> Assign an X variable to Heart History. Assign a Y variable to 
Cholesterol. 


*> Choose Nonparametrics from the Compare menu. 
‘> Click Kruskal-Wallis. 

*> Click OK. 

‘> Choose Table from the View menu. 


Kruskal-Wallis X1: Heart History 11: Cholesterol 


® Groups 4 











Lad el ?; 
eee: Ce 
2.088 p = 5544 

* tied groups 





The first page contains summary information. The view title names 
the X and the Y variables. The summary statistics table shows: 


¢ degrees of freedom 
¢ number of different groups | 
* total number of cases read 
* Kruskal-Wallis H statistic and probability 
If there are tied groups the table includes: 
+ Kruskal-Wallis H corrected for ties and probability 
* number of tied groups 


*> Click the down arrow in the scroll bar. 
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Kruskal-Wallis X71: Heart History Y1: Cholesterol] 


Group: * Cases: > Rank: Mean Rank: 


Ca 7 Ce 
| SE aE a Oo 
ee os a 
eid, | ies nae 


The following pages show summary information for each group with 
a maximum of five groups per page. The information for each group 
includes: 






* group name 
* number of cases 1n that group 
¢ sum of the ranks 

¢ mean rank 


There are no graphic views available. 


Friedman Test 

The Friedman test is a two-way analysis of variance by ranks for 
matched samples. It tests whether 3 or more matched samples, 
designated by X variables, are from the same population. The 
number X variables that can be assigned 1s limited to 1,150. 


Itis a ManyX statistic. One result is calculated using all assigned X 
variables. 


This example tests the five Cholesterol measurements. 


*> Assign X variables to the Weight, Cholesterol, Triglycerides, 
and HDL. 


*> Choose Nonparametrics from the Compare menu. 
*> Click Friedman. 

*> Click OK. 

«> Choose Table from the View menu. 


Friedman 4 X variables 






3 
Se |: TT. (SEMI atic 
228 632 p = 0001 


Chi corrected for ties 229.114 p = .0001 


* tied groups 2 
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The first page contains summary information. The view title 
specifies the number of X columns. The summary statistics table 
shows: 

¢ the degrees of freedom 

¢ the number of samples 

¢ the total number of cases read 

¢ the Friedman Chi,;-Squared and probability 
If there are tied groups the table includes: 


¢ the Chi,-Squared corrected for ties and probability 


¢ the # of tied groups 
*> Click the down arrow in the scroll bar. 


Friedman 4 X variables 


Name: = Rank: Mean Rank: 


Ca: sae ~ RA 
Tree OOO | —iiage 
“EEE” 7 WAR < E 


The following pages show summary information for each sample 
with a maximum of five samples per page. The summary 
information for each sample includes: 











¢ sample name 
* sum of the ranks 
_* mean rank 


There are no graphic views available. 
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The Tools Menu 


Chapter 6 — Specialized Data Handling 


This chapter focuses on data handling techniques. These include 
such data massage functions as transformations, new column 
creation by formulae, and recoding. All of these functions are 
performed using the Tools menu. 


This section details the data handling selections on the Tools menu 
in the order they appear. The Preferences, New Column, Format, 
Edit Categories, Select Range, Edit Range, and Clear Range 

choices have been previously examined and are not discussed here. 


Formula 


The Formula choice on the Tools menu allows the creation of new 
columns using formula transformations of existing columns in the 
dataset. The Formula choice provides a simple, straight-forward 
method for adding, multiplying, or dividing two columns or adding, 
multiplying, or dividing a column by a constant. The created 
columns are appended to the nght side of the dataset. 


*> Choose Formula from the Tools menu, and this dialog box 
appears: 


Nome: [Ce 


None a) 
— 
Sart > 
Onerand 1: Onone operand 2: 
©- 
o- 


O° Cholestero! 
Trigiycerides O- Trigiycerides 
HDL Omeen | HDL 


O sum 


Decimal Places: 


O0 O!1 O2 ©3 O04 05 O68 O7 OF OY 


a) 
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The text entry rectangle at the top of the box labelled Name contains 
the name of the column being created. The default name entered here 
is the next highest column number from the last column added to the 
set. 


There are three scrolling lists in the window: one centered below the 
Name rectangle; one to the left labelled Operand 1; and one to the 
right labelled Operand 2. The centered list contains the 
transformations that are available to be applied to the columns in the 
Operand lists. The two Operand lists each contain every column in 
the active dataset (other than String columns). These three lists 
combine with the mathematical operators between the two Operand 
lists to define the formula that creates the new column. 


The buttons labelled of Opl and of Op2 enter the transformation 
(highlighted in the centered list) of the column (highlighted in the list 
below it) into the formula. 


The 28 transformations available are: 


None 

1/x 

Sart 

|x| 

x2 

xn 

In(x) 
In(1+x) 
log(x) 
log(1+x) 
log2(x) 
6" 

10x 

ae 

sin(x) 
Cos(x) 
tan(x) 
arcsin(x) 
arccos(x) 
arctan(x) 
sec(x) 
csc(x) 
cot(x) 
sinh(x) 
cosh(x) 
tanh(x) 
arcsinh(x) 
arccosh(x) 
arctanh(x) 


o e * ° ° . ° ° & @ a e ¥ ® e e 8 e = e ° oa & e : @ e @ ¢ 


Note: n is specified in a dialog box that appears when the selection 
x“n is transformed. 


All trigonometric functions require measurements in radians. 
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*> Click on sin(x) in the transformation list. (Scroll through the list 
to find it.) 


*> Click on Weight in the Operand 1 list. 
¢> Click of Opl. 


The name of the list on the left changes from Operand 1 to sin(X) of. 
Assignments of transformations work the same way for both the 
Operand lists. 


At the top of the transformations lists is the choice None. This means 
that no transformation 1s applied to the column. None allows 
columns to be combined using the mathematical operators alone or 
In conjunction with a constant. 


The mathematical operators between the two Operand lists define the 
relationship between the two transformed (or not transformed) 
Operand columns. When the radio button operator none is selected, 
the Operand 2 list becomes inactive (gray), and the new column 
formula includes only the column transformation in the Operand 1 
list. When the radio button operators mean or sum are selected, the 
Operand 1 and the transformations lists become inactive (gray), and 
the new column formula includes the operator and the columns 
selected in the Operand 2 list. 


When any of the other four radio button operators (+, -, *, +) are 
selected, transformations are selectable for both Operand lists. 
Operand 2 may be substituted by a user specified constant. 


*> Click on the radio button operator for multiplication (*). 


*> Click in the text entry rectangle labelled k (below the Operand 2 
lists). 


The Operand 2 list becomes inactive (gray) when the constant 
specification rectangle is activated, and the constant rectangle 
becomes gray when the Operand 2 window is activated. 


Radio buttons along the bottom of the dialog box allow selection of 
the number of decimal places to be displayed in the new column. 
Any selection made here can be altered later using the Format 
selection on the Tools menu. 


The Calculate button creates a new column with the name specified 
in the Name rectangle using the formula and transformations 
determined by the selections in the dialog box. The Formula window 
does not close when Calculate is clicked. The new column 1s added 
to both the scrolling Operand lists, a new default name appears in the 
column name rectangle, and another column can be defined. 


The Exit button closes the box. Clicking Exit does not define a new 
column regardless of what is set up in the box. 
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Transform 


The Transform choice on the Tools menu allows the creation of 
new columns from a transformation of an existing column in the 
dataset. The created columns are appended to the nght side of the 
dataset. 


‘> Choose Transform from the Tools menu, and this dialog box 
appears: 


: Sart 
Weight an [| 
Cholesterol ee H~2 


Triglycerides ee wn 
HDL He In(x) 
LDL In(1+x) 


Decimal Places: 


Gord tm? @S"O 4 O35 O6 O7 O8 OD 


| Transform | 





The text entry rectangle labelled Name at the top of the box contains 
the name of the column being created. This name reflects the 
selections in the box's two scrolling lists: Select Column and 
Transformation. 


The Select Column lists contains all the columns in the active dataset 
(except String columns). The transformation lists contains the 
following 35 transformation functions: 


e I/x 

¢ Sart 

saan 
oe 

. a 

¢ = In(x) 

¢ In(1+x) 
* log(x) 
¢ log(1+x) 
¢ log2(x) 
ila ie 
ee 

va se 

*  gsin(x) 

* cCOSs(x) 
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Note 


tan(x) 
arcsin(x) 
arccos(x) 
arctan(x) 
sec(x) 

CSC(x) 

cot(x) 

sinh(x) 
cosh(x) 
tanh(x) 
arcsinh(x) 
arccosh(x) 
arctanh(x) 
Rank 
Standard Score 
Running Sum 
Difference 
Percentages 
Lag 


Moving Averages 


: nis specified in a dialog box that appears after when the 


selection xn is transformed. 


All trigonometric functions require measurements in radians. 


The following provides a description of the final seven 
transformations: 


Rank is the rank of the value in the column. 


Standard score is (value—mean) divided by the standard 
deviation. 


In running sum, each value is replaced by a summation of the 
values through that point. 


The difference is the difference between the current value 
and the value in the preceding row. 


In percentages, each value is divided by the sum of the 
column, and then multiplied by 100. 


In lag, columns are lagged by a specified number of rows (a 
dialog box appears after this choice asking for the lag). 


Moving average computes a simple moving average for the 
specified period (a dialog box appears after this choice asking 
for the period). 


Radio buttons along the bottom of the dialog box allow selection of 
the number of decimal places to be displayed in the new column. 
Any selection made here can be altered later using the Format 
selection on the Tools menu. 
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The Transform button creates a new column with the name 
specified in the Name rectangle using the transformation and column 
selected. The Transform window does not close when Transform is 
clicked. The new column is added to the scrolling Select Column 
list, and, although the name specified for the new column does not 
change (since it still describes the set-up in the box), another column 
can be defined. 


The Exit button closes the box. Clicking Exit does not define a new 
column regardless of what is set-up in the box. 


Recode 


The Recode selection in the Tools menu creates a new column that 
is a recoded version of a column in the active dataset. The new 
column is added to the nght side of the dataset. 


*> Choose Recode from the Tools menu, and the following dialog 
box appears: 


Name: Recode of Gender-~- 


Select column: Choose desired recoding: 
Gender <j © Continous data to Category data 


Age ™ @ Missing Values to specified value 


Weight 
Chalesterol 
Triglycerides 


o 5 Cia 


© Renge of Values to specified vaiue 





The text entry rectangle labelled name determines the name of the 
column being created. The default name that appears in this 
rectangle is Recode of (column selected). 


The scrolling list labelled Select column contains each column in the 
dataset (except String columns). The column selected here is the 
column that is recoded to form the new column. 


Three types of recoding are available through radio button selection 
under the heading Choose desired recoding: Continuous data to 
Category data; Iviis"ing Values to specified value; and Range of 
Values to specified value. 


Continuous Data to Category Data 
If you do not understand how to create and use category sets, please 


review the section in Chapter 2 on Category data (in the section on 
creating datasets) before using this recode option. 


Chapter 6 — Specialized Data Handling 


Continuous data is any data of types real, integer, or long integer. 
This recode option allows numeric data to be reclassified into 
category data. The category set that the data is coded to can be one 
that exists already (in either the dataset's library of category sets or in. 
the general StatView II Library of category sets) or it can be a new 
category set. 

¢> Choose Recode from the Tools menu. 

*> Highlight Weight in the Select column list. 

*> Click Continuous data to Category data. 


*> Click Recode, and this dialog box appears: 


Please choose the new column's category. 


Smoking 7 Lipid Date 


alcohol fre... 
Heart Attack 


. 
. 
. 
. 
- 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 





This dialog box is the same one used to select a category set for a 
new category column being defined using either New in the File 
menu or New Column in the Tools menu. The File button 
allows selection of a category set from either the file's own 
library of sets; from the library of sets in any other open dataset; 
or from the StatView II Library. The button Select assigns the 
highlighted choice in the scrolling list as the category set for the 
new column. The button New allows definition of a new 
category set for the column using the same dialog box used to 
define category columns when creating a dataset or adding 
columns to an existing one. 


*> Click New, and this dialog box appears: 


Create 6 New Category 


Category Neme: |MEAi se 
Element Name: [Element | 
' Add | enlace (delete) 


Lipid Date 


H 
Comer | Cre 





«> Enter Weight Recode for Category set name. 


Chapter 6 — Specialized Data Handling 247 


248 





Enter Light for element one name, and click Add. 
Enter Medium for element two name, and click Add. 
Enter Heavy for element three name, and click Add. 


Click Done, and this dialog box appears. 


Enter a range and select a category element: 


Recode Value: 


lower bound: upper bound: Medium 


@:, 9: asa] | eeu 
Oar) Ox 


The text entry rectangle labelled Lower Bound has the lowest 
value in the column being recoded in it, and the rectangle 
labelled Upper Bound has the highest value in that column. The 
scrolling list to the right contains the elements of the category set 
into which the continuous data is being recoded. The first 
element is highlighted. 


This example demonstrates that this recode dialog box can be 
used with minimal mouse activity if data is recoded from lowest 
to highest to category elements beginning with the first and 
proceeding to the last. 


Click the second Less Than button so that the button so that the 
equation reads /07 is less than or equal to x and x is less than 
234. 


Press Tab and the Upper Bound box becomes highlighted. 
Type 135, and press Enter (this is the same as clicking More.) 


The values in the Weight column that are greater than or equal to 
107 and less than 135 will be Light in the new column. 


Notice that Medium is now highlighted (ready to be defined) in 
the scrolling list. The next set element is highlighted whenever 
More is clicked (or Enter or Return is typed). Also, the Lower 
Bound box now contains the number 135 (the top of the range 
previously defined) and the upper bound box contains the value 
234 (the highest in the column). 


*> Press Tab and the Upper Bound box becomes highlighted. 


*> Type 180 and press Enter. 
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All values in the Weight column from 135 to 179 will be 
Medium in the new column. The same events occur with the 
definition of the second bounds. Heavy is now highlighted, and 
180 is entered as the lowest bound. 


*> Click the second Less Than or Equal to button so that the button 
so that the equation reads /80 is less than or equal to x and x is 
less than or equal to 234. 


*> Click Done and the box closes. 


All values in the Weight column from 180 to 234 will be Heavy 
in the new column. 


*> Click Exit from the Recode box. Scroll to and examine the new 
category column titled Recode of Weight. 


It is possible to recode a Category column to a Category column 
using this Recode option. The values that appear in the Lower and 
Upper bounds boxes are the ordinal numbers associated with the 
elements in the Category column being recoded. 


Missing Values to Specified Value 


This option allows the recoding of missing values in continuous data 
to either a specified value; the mean of the column being recoded; 
the geometric mean of the column; or the harmonic mean of the 
column. The option allows the recoding of missing values in 
Category data columns to a specified element of the column's 
category set. 


Recoding for both data types can be performed on all missing values 

in the column or just those missing values that occur in included 

rows. (Including and excluding rows is discussed later in this 

chapter.) This affects the calculation of the means (if that calculation 

is selected for recoding) as well. 

¢> Activate Demonstration Data and enter a missing value in any 
cell of the Weight column and any cell of the Gender column. 
(Enter missing values by typing a period in the cell.) 

‘> Choose Recode from the Tools menu. . 

*> Choose Missing values to specified value. 

‘> Choose Weight. 


*> Click Recode, and this appears: 
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Recode missing values to: 


/ 
© Mean 
© Geometric Mean 


© Harmonic Mean 
(] Recode INCLUDED rows only 


Cor 





The selections in this box determine the value to which all 
missing values in the column are recoded. If the check box is 
selected, only missing values in included rows are recoded, and 
the means for recoding are recalculated from included rows only. 


Enter 150 for in the text entry rectangle, make sure its radio 
button is selected and press Enter. Press Exit in the next box. 


Examine the cell in Weight in which you entered a missing value 
symbol. 


Choose Recode from the Tools menu. 
Choose Missing values to specified value. 
Choose Gender. 


Click Recode, and this appears: 


Recode missing values to: 


female 


(J Recode INCLUDED rows only 





The elements in the category set that defines the Gender column 
appear in a scrolling list. Select the element to which you wish to 
recode. Again, a check box allows recoding of included rows 
only. Clicking Done recodes the specified missing values in the 
Gender column. 
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Range of Values to Specified Value. 

This recoding option works much like the first option (recoding 
continuous data to category data). A set of bounds are presented 
representing the lowest and the highest data points in the selected 
column. The recode-—to value can be a user specified number, 
although convenient default values for it begin at one and increment 
one for each set of bounds defined. 

*> Choose Recode from the Tools menu. 

‘> Choose Weight from the Select column scrolling list. 


*> Choose the third recode option: Range of Values to specified 
value. 
«> Click Recode, and this appears: 


Enter & range and & value to recode to: 


lower bound: upper bound: Recode to: 


ior od cc 


This box works exactly as the box in the first recode option. Like 
the first option, this option can be used to recode Category data. 
The ordinal numbers assigned the elements of the category data 
column being recoded are entered in the bounds rectangles, and 
the column is recoded to an integer column. 





This option is useful for recoding outlying data to missing values, or 
for collapsing categories based on the ordinal numbers assigned to 
the category elements. 


Sort 


The Sort choice in the Tools menu allows you to sort the entire 
active data set using any of its columns as the sort key. Sorting 
cannot be undone, although the creation of a senes column in a pre- 
sorted dataset provides a key to re-sort the data to its original order. 
(Series, the next Tools menu choice, is discussed immediately 
below.) 


*> Choose Sort from the Tools menu, and this appears: 
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Select Column: 


Gender 

Age 

Weight fe 
Cholesterol a © descending 
Triglycerides BE 

HDL 


@ ascending 


Radio buttons allow either ascending or descending sorts of the 
dataset. The column selected in the Select Column scrolling window 
is the key column; its cells are sorted in either ascending or 
descending order, and all rows in the set are rearranged in the same 
order. 





Sorting on a category column causes the elements in that column to 
be grouped according to their ordinal values. 


Series 


The Series menu choice on the Tools menu allows the creation of 
new columns that are either a uniform random series; a normalized 
random series; or a time series with a user specified starting value 
and step increment. Displayed decimal places for the new column, 
and the number of values (cells) in it are specified as well. The 
column is added to the right side of the active dataset. 


*> Choose Series from the Tools menu, and this appears: 


Create a new column using specified values: 


Number of values to be created:/95 | 


Column contains: 


@ Uniform Random Normalized Random 


© Time Series Sorc Step:(7 | 


Decimal Places: 


Q0,,.01 O2;035°-04 O35 O06 O7 O8 O89 


The text entry rectangle labelled Name contains the default name 
Column (#). This name is assigned to the created column, and is 
user specifiable. 
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Below Name, a text entry rectangle labelled Number of values to be 
created allows specification of the number of values to be generated 
in the column being created. The default value in this rectangle is the 
number of rows in the active dataset. If a value less than the number 
of rows in the dataset is specified, the remaining cells are filled with 
missing value symbols. If a value greater than the number of rows in 
the dataset 1s specified, the rows added to the other columns in the 
dataset are filled with missing values. 


Radio buttons under the heading Column contains allow three 
selections for the new column: Uniform Random: Normalized 
Random; and Time Series. Uniform Random generates a column of 
random numbers greater than zero but less than one. Normalized 
Random generates a column of random numbers with a standard 
deviation that approximates one and a mean that approximates zero 
(fits the normal curve). 


Time Series creates a column beginning at the value specified in the 
Start text entry rectangle and incrementing by the value specified in 
the Step text entry rectangle. 


Time senes columns can be used to reorder sorted columns. A time 
series column beginning at one and incrementing by one remembers 
the order of a dataset. If the dataset is sorted, it can be returned to its 
original order by sorting again on the time series column. 


Radio buttons along the bottom of the Senes dialog box allow 
specification of the number of decimal places to be displayed in the 
created column. Any value chosen here can be changed later using 
the Format selection in the Tools menu. 


Splitting Columns 


You may split variables by category to easily break out subsets of 
your data. This allows you to generate summary Statistics on the 
individual groups of your dataset. Any category column can be used 
as a key to split other variables. The columns created by splitting are 
placed in a new dataset and assume the data type of the columns that 
produced them. 


For example: 
*> Activate Demonstration Data if it 1s not already active. 


¢> Choose Split Columns from the Tools menu, and the following 
dialog box appears: 
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Create New DataSet by Splitting Columns 


Smaking History Gender 
Alocohol use Age 
Heart History Weight 
Cholesterol 
Triglycerides 
HDL 
LDL ee 
7 ideal body wt. 


Done | Cancel 
Cancel | 


The text entry rectangle titled Name lets you determine the name 
of the new dataset being created with the results of the splitting. 


The scrolling selection list on the left labelled Select Split Key 
contains all the category columns in Demonstration Data. Only 
one of these may be selected as the key to split data columns. 


The scrolling list to the right labelled Select Column(s) to Split 
contains all the string, integer, real, and long integer columns in 
the data set. Any number of these column may be selected to be 
split by the category column key. 


Select Gender in the Split Key list. 


Select Age, Weight, and Cholesterol from the Column(s) to 
Split list. 


Click Done, and a dataset appears on the desktop. Zoom it to full 
size. 
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bere Se | female - Age | male - Weight 
LS a See 


I | SSS 
i AE OR 
a 1 ae 
a a aC 
a ae ee ee 
- ae a ee 
a. a a 
i a a he, 
S Ge SE | 
< | A 
‘ a a: ememcon 
: ae es 
he Le 


Notice that the Age column has been split into two columns: one 
containing the ages of the males and the other containing the 
ages of the females. The same applies for the Weight column and 
the Cholestero] column. You can now assign X variables to 
these column and generate descriptive statistics for each of these 
sub-groups. 
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Dataset Size 
Considerations 


Appendix A — StatView II Memory Limits 


Dataset size in StatView II is limited by RAM. The more RAM 
available, the larger the file size can be. 


StatView II allocates memory in a dynamic manner allowing 
datasets with fewer columns to have more rows than datasets with a 
larger number of columns. Additionally, columns of different types 
(real, integer, long integer, category, and string) require different 
amounts of memory. These factors, while they add to the utility of 
the program, make it difficult to provide hard and fast dataset size 
limits. The following information, however, can help you plan the 
size of datasets. 


When a column is created, StatView II allocates a fixed number of 
rows for that column whether or not the rows actually exist (have 
data entered in them). For category columns 1524 rows are allocated; 
for integer columns 762 rows are allocated; for long integer columns 
381 rows are allocated; for string columns 381 rows are allocated; 
for real columns 127 rows are allocated. When rows exceeding the 
original allocation are entered, Stat View II allocates another block as 
large as the first. 


Therefore, 127 rows of real data takes no more memory than 1 row; 
but 128 rows of real data takes twice as much memory as 127 rows. 
254 (2 * 127) rows of data takes no more memory than 128 rows, but 
255 (2 * 127 + 1) rows of real data takes three times as much as 127 
rows of real data. 382 (3 * 127 + 1) rows of real data take four times 
as much as 127 rows; 509 (4 * 127 +1) rows of real data take five 
times as much as 127 rows, and so on. 


The following steps can free RAM memory: 
¢ Do not use MultiFinder, or if you must use MultiFinder 
allocate more memory for StatView II by using the Get Info 
dialog. 


¢ Close desk accessories. 


¢ Close all StatView II datasets other than the one in use. 
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Memory Alert 


¢ Clear the Clipboard by selecting it from the Windows menu 
and choosing Delete Clipboard from the File menu. 


¢ Tum RAM cache off. 
¢ Avoid pasting large color pictures into the view window. 


The absolute maximum size a StatView II dataset can be is 32,765 
rows by 8,192 columns. Currently, there is no Macintosh with 
enough RAM to create a set this big. 


A good indication that StatView II dataset size is crowding memory 
is Sluggish performance. If editing and statistical operations take 
unusually long to perform, the program is running out of memory. 


Two dialog boxes appear warning that you are approaching memory 
limits. The first box states “Sorry, but Stat View II is running low on 
memory. Please close any unnecessary files and desk accessones.” If 
memory problems persist, a second box appears stating "StatView IL 
is running dangerously low on memory. It is imperative that you 
close any unnecessary files and desk accessories." 


If StatView II does run out of memory, you are still able to save data 
files open on the desktop. If there are no files open that have unsaved 
changes, StatView II displays a bomb box that states, “Sorry, but 
there isn't enough memory to continue. Since you didn't make any 
changes to your files, clicking OK will end StatView II.” At this 
juncture, you have no choice but to click OK and return to the 
desktop. 


If you have made changes to your files, StatView I displays this 
dialog box: 





= Sorry, but there is not enough memory to continue. 
9 You may save your files before returning to the 
Finder; just check the box next to the files whose 
changes you wish to keeg. 


CJ Questiannaira -- Not Changed. 
C) File slot three not used 
& untitied-2 
CO File siot fiue not used ' 
CJ File slot sin nat used 
C) File slot seven nat used 
C] File siot eignt not used 


| C) Demonstratian Data -- Not Changec 
| 
| 
| 


Since there can be eight StatView II data files open at any one time, 
there are eight save—file check boxes in the box. If a file is open, and 
changes have been made to it, Stat View II displays the file's name 
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next to a check box. Check the box to save the file. After you click 
OK, the standard Macintosh name-save file dialog box appears. 


If a file was open but no changes had been made to it, its name is 


displayed in gray next to a check box with the notation “not 
changed” appended to it. 
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Descriptive Statistics 


Appendix B — Formulae and References 


n = number of non-muissing, non-excluded values 


count =n 

Mean= “ (referred to below as xXory ) 
-x)2 

Variance (07) = MKD) 


Standard Deviation (0) = ¥ 0? 


Standard Error ©,) = a 


ie eed 1000 
Coefficient of Variation= “S_ 


Minimum = smallest value among X 
Maximum = largest values among X 
Range = Maximum - Minimum 

Sum = 2X 

Sum Squared = (LX?) 

# missing = Count of the missing values 
Confidence intervals: t distribution 


error = tan.) * Ox where a = 1 - selected % 


Confidence intervals: normal distribution 


x * user std. dev. 
error = in 
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x = value such that the probability of normal deviation is less 
than l-selected% 


lower = X - error 
upper = X +error 


percentiles: The pth percentile using linear interpolation is: 
(1-f)x, + ae 


where x, and x,,, are the kth and k+1th non-missing, non- 
excluded values in the column. 


k and f are determined from the value v shown below. k is the 
integer part of v and f is the fractional part: 


a. <i 5 
v= vay + 0.5 
where n is the count, p is the desired percentile. 


es 


Geometric Mean= e 








Ex 
Harmonic Mean= —p—-! 
' | ro 
urtosis = 2 
: 
m 
Skewness = — 
Where: 
D(X-X)? 
M2 = n 
¥(X-x)? 
m3 = n 
L(X-x)* 
a ne 


Comparative Statistics | Comparative Percentiles 


For the X and Y column the following percentiles are computed: 


262 Appendix B — Formulae and References 


1, 2, 3, 4, 5, 10, 20, 30,40, 50, 60, 70, 80, 90, 95, 96, 97, 98, 99, 100 
See descriptive statistics above for the computation of percentiles. 
One Sample t-Test 

N = number of x observations 


U = population mean, entered by user 








Reh) 
rae SO -tex- 
N 
N(N - 1) 
DF=N-1]1 


Paired t-Test 


N = number of paired x,y observations 








Dj = x; - y; 
i 20 
=(D;)* - (=D)? 
he | ae 
N(N - 1) 


DF=N- ] 
UnPaired t-Test 
N, = number of observations in group 1 


N, = number of observations in group 2 


X; is the mean of the group 1 observations 
X» is the mean of the group 2 observations 












(2X,)? (E X5)? 


1 








+ 














Shige 
| 


(Se | 


a \ NAN, 


N,+No-2 


where X, designates the first group and X, the second group. 
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DF=N1+N2-2 
Correlation Coefficient and All Regressions 


Covariances and correlations are computed in StatView II using | 
provisional means. See the section Computation Considerations in 
Chapter 2 for details on this procedure. 


StatView II applies the Sweep Operator to the XX' matrix of cross 
product deviations in order to calculate regression coefficients. 
Sweeping operations are discussed in Draper and Smith (1981), 
Hocking (1985) and Goodnight (1979). The sweeping operation is 
used to add and delete variables from the regression equation. Beta 
coefficients, partial correlations, multiple correlation, partial Fs and 
residual sum of squares are computed as each variable in enters (or 
leaves) the regression equation. 


The calculation of confidence bands for the mean and confidence 
intervals for the slope of a simple regression is discussed in Draper 
and Smith (1981) and Sokal and Rohlf (1981). 

One factor Factorial model 


For a single factor factorial model StatView II uses the procedures 
outlined by Winer (1971) in Chapter Three. 


The Model II estimate of between component variance 1s discussed 
both by Winer and Afifi and Azen (1979). The formula is as follows: 


(Mean Square, iveen groups ~ Mean Square Vithin groups)<! - 1) 
Ep 
2,J;- s ts 1..] 


, 2 


where Ji is the count of non-missing non-excluded values for the i" 
group. 


Multiple comparisons are discussed in Winer (1971) and Milliken 
and Johnson (1984). The formulas used are listed below: 


For all tests: 


k is the number of groups 
MD is the mean difference between the group means 
MS, is the between groups mean square 


t, is the two tailed t value at the user entered significance level at the 
within groups df 
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where N, is the count of group a and N, is the count of group a 


Fisher's Protected Least Significant Difference (PLSD) 
Jr * MS. 


Scheffé F test 


(MD * MD) 
r* MS 
k- 1] 


Dunnett's t 
| MD | 


J/r* MS 


Two and more factor Factorial models 


For a N-factor Factorial model, StatView II starts out by building a 
N-dimensional matrix of cell counts and sums. Note that any 
observation that contains a missing value in any of the X columns or 
the single Y column is dropped from the counting. Once this 
partitioning has been accomplished, StatView then builds all the 
possible columns totals and checks to see if the data is balanced. If 
all cell frequencies are equal, then StatView uses the algorithms 
outlined in Winer (1971) in Chapter Six to compute the ANOVA 
table. If the data was not balanced, StatView resorts to the algorithm 
described by Hocking (1985) on pg. 147. Missing cells are handled 
as described by Hocking on pg. 163 by noting where the initial 
sweep of the S matrix ended and sweeping on the non-swept 
columns after sweeping out the columns associated with the 
hypotheses of interest. 


One factor Repeated Measures model 


For a single factor repeated measures model StatView II uses the 
procedures outlined by Winer (1971) in Chapter Four. 


For a single factor model reliability estimates are computed (Winer, 


p. 283). Reliability estimates are given for the mean of all treatments 
and for a single treatments. The formulae are provided below: 
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MS etireen groups ~— MS yithin groups 


M Swithin groups 


~- 


K9 
mean of all treatments: Poa eek 





K6 
mean of single treatment: 1 4 43 


k = number of treatments ( X columns) 


Multiple comparisons are discussed above. The calculations are the 
same except that MSresidual is substituted for MSbetween groups. 


Two and more factor Repeated Measures model 


As with the N-factor Factorial model, StatView II starts out by 
building a N-dimensional matrix of cell counts and sums. However, 
the repeated measure ANOVA ignores any observation that contains 
a missing value in any X or Y column. If the model is balanced, 
StatView II then computes the ANOVA as described by Winer 
(1971) in Chapter Seven. StatView I performs only the repeated 
measures with one within factor; two or more within factors are not 
supported. If the model is unbalanced but has only two factors and 
contains no missing cells, Stat View II uses the least squares 
algorithm outlined in Winer (1971) on pg. 599. Three-factor and 
higher unbalanced repeated measure ANOVAs are not supported by 
StatView IL. 


Goodness of Fit Chi-Square 


(Ose 
E 


where X column contains the Observed Values and Y column the 
Expected Values 


x2 =5 


Contingency Table Analysis 
N = number of observations 


t = # rows of contingency table — determined from the groups of the 
Y column 


c = # columns of contingency table — determined from the groups 
of the X column 


DF = (r-1){c-1) 


, 
> ioe? 
x oe 
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where: 


E = CRN, the expected values 
C = column total 

R = row total 

O = observed value 

N = grand total 


G Statistic = 2 |[2 OInO]-[ZRInR]-[=ZCInC]+ Ninn] 


— 2 x2 
Contingency Coefficient = I sft 
Phi = |_X* 

N 


Cramer's V= /——X<- 
N(q-1) 


Note: when r=c=2, V is the same as Phi where q = min(r,c). 


Chi-Square with continuity correction (r=c=2 only) 


N z 
2 NIAD - BC aan 
(A+B)(C+D)(A+C)(B+D) 
where: 


A=observed value in row! column] 
B=observed value in row1 column2 
C=observed value in row2 column] 

=observed value in row2 column2 


Mann-Whitney U 


n, = number of observations in group 1 

n> = number of observations in group 2 

N= Ny+ No 

n, = smaller of n, and n, or n, if they are equal 
ny = larger of n, and n, or nz if they are equal 
R, = LRank of first group * 

R» = LRank of second group 

R, = Rl if nj < n> else R> 


n,(n, +1 
Usnn, + ae’ - R, 
U'=nyno- U 


nn 
Mean = —Lz 
Zz 
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Standard Deviation = / i Stas 1) 
| 


a U - Mean 
~ Standard Deviation 


Correction for Ties: 


- sr] 


Standard Deviation becomes = 


NyNs | (N3-N) 
N(N=-1) 12 





t>-t 
12 





Where T = and t is the number of observations ties fora 


given rank. 
Wilcoxon Signed-Rank 34 


D = X - Y for each matched pair 

N = number of matched pairs excluding those with a D of zero 
R = Rank of |D| 

R+ = >R with D>0 

R- = }R with D <0 

T=Rtif Rt <R- else R- 


NCN+1) 
M > (Se ee 
ean A 


Standard Deviation = OS 


ye T - Mean 
~ Standard Deviation 


Correction for Ties: 


Standard Deviation = /NQN+I) (2N+1) _ ST 
24 2 


Where T = t3-t and t is the number of observations tied for a given 
rank. 


Spearman Rank Correlation Coefficient 


N = number of matched pairs 

R, = Rank of X; 

Ry = Rank of Y; 

D =R, - R, for each matched pair 
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6=D* 


Rho(p) =1- 
“<r N(N*=-1) 


cs.p7 A=] 
Correction for Ties: 


=x*+ Sy2- Sp? 


27 (Ex*) (yt) 


Rho (p) becomes = 


3 ye ‘ NS - N 





2-31 
3 

3 2 < Ne =- N 
3. 
i 
12 


where t 1s the number of X observations tied for a given rank. 
wig? 
atl € 


where t 1s the number of Y observations tied for a given rank. 





Kendall Correlation Coefficient 


N = number of matched pairs 
C = Kendall Statistic determined as follows: 


Rank the observations on the X variable from | to N. Rank the 
observations on the Y variable from 1 to N. Arrange the list of N 
Subjects so that the X ranks of the subjects are in their natural order, 
1.€. 1, 2, 3,...N. For each Y rank, count the number of ranks below it 


which are larger. Then subtract the number of ranks below it which 
are smaller. The Sum of this for each Y is C. 


1 
> N(N-1) 


D =R, - R, for each matched pair 


Standard Deviation = / 2(2N+5) 
ON(N-1) 


™ t 
— Standard Devistion 


Correction for Ties: 
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t becomes . , 
> N(N- = Tp dty 
t7-t 
hee 5 


where t is the number of X observations tied for a given rank 
t*-t 
rs 


where t is the number of Y observations tied for a given rank 





Ty = 


Kolmogorov-S mirnov 
See Siegel pp. 127-136 and Hollander. 
Wald-Wolfowitz Runs Test 


N, = number of group | observations. 

N. = number of group 2 observations. 

R = # of Runs. A run is any sequence of scores from the same 
column. 


2n4n 
Mean = —L= +] 
Ny rns 


Std. Deviation = aaa? ana i en 


IR - Mean| - 


— Std. Deviation 


Note that there are no correction for ties. Ties may invalidate the 
results. 


Kruskal-Wallis Test 


k = number of groups 

. = number of cases in j'® group 
N= = nj, the number of — in all groups combined 
Ri = sum of ranks in the j“ group 


k 
a2 
a > olin. 
H = WNT 2 3(N + 1) 


Correction for Ties: 
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References 


kK 
12 R, 
eat >> -3(N+1) 





H becomes 





where: 


T = t? - t(when t is the number of tied observations in a tied group of 
scores) and LT directs on to sum over all groups of ties. 


Friedman Test 


k = number of X columns 

N = number of rows 

R; = R for each column where R is the score ranked by row, 
jel. K 


12 
= leet [ERI] - SN(k+1) 


Correction for Ties: 


y2_ 122(R-NR)* ST 
K 
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Index 


S sign 44 

4th DIMENSION 46 

S5Oth percentile 125 

68881 3 

Activating windows 24 
Adding columns 47, 241 
Align to Grid 110 
Alpha-numeric data 44 
Altering datasets 47 

Analysis on data in groups 58 
ANOVA 211 

Arrow Heads 112 

ASCI text files 6 

Assigning column types 25 
Assigning variables 14, 49 
Axes 113 

Axis bounds 113 

Balanced 211 

Bar Chart 52, 61, 74, 77, 85, 86, 87, 88, 153 
Bar control 89, 103 

Bigger Points 95 

Box Plot 52, 61, 90, 136, 148 
Category data 26 

Category Name 27 
Cellulation 96 

Center Justify 111 

Changing column data type 29, 48 
Changing column names 48 
Changing window size 25 
Chi-Square 225 

Chi-Square Goodness of Fit 225 
Chi-Square with Continuity Correction 225 
Choose X 49, 51 

Choose Y 49, 51 

Choosing a Statistic 11 

Clear 40, 108 

Clear Ranges 68 

Clear X&Y 49, 51 

Close box 22 

Coefficient of variation 128 
Color 2, 4, 104, 105, 111, 112 
Color Picker 106 

Column names 43 

Column types 44 


Command-period 36 
Comparative bar chart 88 
Compare menu 54 

Compare Percentiles 82, 159 
Comparison percentile chart 82 
Composite mode 92 

Composite view 62 
Composite/paging control 92 
Confidence bands contro] 80, 100, 176 
Confidence Intervals 87, 88, 89, 132, 143 
Connected error bars 90 
Contingency Coefficient 225 
Contingency table 225, 227 
Continuous data 246 

Copy 36, 108 

Copy View 108 

Copy View command 62 
Copying data 36 

Copying graphs 108 

Copying values from table view 62, 108 
Correlation 167 

Correlation coefficient 166, 168 
Correlation matrix 169 
Cramer's V 225 

Creating a graph 14 

Creating datasets 12, 25 
Cumulative frequency curve 137 
Cursor movement 33 

Custom Rulers 109 
Customizations 115 

Cut 40, 108 

Data window 22 

Dataset structure 24 

Decimal places 63 

Delete 41 

Describe menu 54, 123 
Difference 245 

Distinguish Variables By 104 
Dividing columns 241 
Drawing 73 

Drawing objects 111 

Drawing tools 73 

Dunnett t-test 215, 222 

Edit Categories 29 
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Edit menu 62 

Edit Palette 106 

Edit Range 67 

Eigenvectors 203 

Element Name 27 

Enter key 34 

Entering and editing data 31 

Entering data 13 

Equal axes control 83, 101 

Error bars 89, 131, 134, 144 

Error bars control 98 

Error bars for scattergrams 99 

Excel 46 

Exclude Row 64 

Expected values 226 

Factor analysis 194 

Factorial model 211 

Fill 112 

Final communality estimate 204 

Fisher's PLSD test 215, 222 

Fitted Values 172, 188 

Font 111 

Format 48, 63 

Formula 241 

Formulae 261 

Frame control 94 

Frequency Distribution 88, 151, 153, 154, 155, 
156 


Friedman Test 239 
Full page views 71 

G Statistic 225 
Geometric Mean 125 
Graphing data 52 

Gnid 109 

Gnd Lines 114 
Grouping columns 58 
Grow box 22 
Harmonic Mean 125, 130 
Harris image analysis 196 
Heywood case 204 
Hide Legend 108, 115 
Histogram 86 
ImageWnter 5 

Import 42 

Import algorithm 43 
Import Example 45 
Importing data 41 
Include Row 64 
Individual X 52 

Input row 31 

Inserting columns 47 
Installation 9 

Integer data 26 
Interactive analysis 64 
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Inverting matrices 60 

Iterated principal axis 196 

Kaiser image analysis 196 

Kendall rank correlation coefficient 234 

Kolmogorov-Smimov 236 

Kruskal-Wallis 237 

Kurtosis 127 

Kurtosis & Skewness 125 

Lag 245 

Large screen 3 

LaserWriter 5 

Layers 102 

Left Justify 111 

Legend 114 

Leptokurtic 127 

Library 27 

Line Chart 52, 61, 74, 77, 82, 83, 84, 89 

Line Width 112 

Lipid Data 6 

Lock bounds 113 

Long integer data 26 

MacDraw 6 

Mac Write 6 

Mann-Whitney U 231 

ManyX statistics 54 

ManyXManyY statistics 56 

ManyXOneY statistics 55 

Matrix inversion 60 

Maximum 126 

Mean 76, 125 

Mean, Std. Dev., etc. 87, 88, 89, 124, 125, 131, 
143 

Median 125 

Memory alerts 258 

Memory limits 257 

Mesokurtic 127 

Microsoft Word 6 

Minimum 126 

Missing cells 211 

Missing values 33, 58, 249 

Mode 125 

Model IT estimate of between component variance 

Monochrome monitor 105 

Moving average 245 

MultiFinder 36 

Multiple analyses 56 

Multiple comparison tests 212 

Multiple regression 171, 179 

Multiple windows 23 

Multiplying columns 241 

Naming columns 25 

Negatively-skewed distmbution 140 

New 25, 63 


New Column 47 

New Data Column Information 25 
No symbols control 85, 102 
Non-color systems 2 
Nonparametrics 230 
Normal distribution 139 
Normal format 69 
Normalized Random 253 
Notch control] 91, 101 
Notched box plot 137 
Numeric co-processor 3 
Numeric data 36 

Old Quickdraw 2 

Omnis 3 Plus 46 

One group t-Test 162 

OneX statistics 54 
OneXOneyY statistics 55 
Open Axis 108, 113 
Opening datasets 10, 22 
Ordinal numbers 28 

Outlier control 92, 100 
Overlap 107 

Overview 9 

Page Footer 71 

Page Title 71 

PageMaker 6 

Paging mode 92 

Paging view 62 

Paired two group t-Test 163 
Palette 73, 106 

Parameter controls 93 

Paste 36, 108 

Paste Transposed 40 
Pasting 36 

Pen 112 

Percentages 245 

Percentile 90 

Percentile control 78, 101 
Percentile plot 77 
Percentiles 77, 125, 129, 136, 137, 148 
Ful 22> 

PICT format 62, 108 

Pie Chart 52, 61, 88, 156 
Pixel] Paint 6 

Platykurtic 127, 140 

Plotter 5 

Point overlap control 94 
Point Size 112 

Point Type 104, 112 
Polynomial regression 79, 81, 181 
Positively-skewed distribution 140 
Post hoc cell contributions 226 
Predicted Values 172, 188 
Preferences 34, 63, 104 


Primary pattern solution 206 
Print scaling 71 

Pninter 5 

Pninting 71 

Quick Assignment 49, 50 
Quick start 9 

QuickDraw colors 105 
Range 126 

Range of values 251 

Range restrictions 65 

Rank 245 

Read only files 69 

Real data 26 

Recode 246 
Recommendations 4 
Rectangle tool 111 
Reference structure solution 206 
References 271 | 
Regression 79, 17] 
Removing frames 71 
Repeated measures model 2] 1 
Requirements 3 

Residuals 172, 188 
Resizing chart 103 

Resizing dataset columns 24 
Resizing legend 114 
Resizing objects 107, 112 
Resizing text 110 

Resizing view window 103 
Resolution 96 

Right Justify 111 

Rotate Left 111 

Rotate Right 111 

Row percents 226 

Rulers 109 

Running analyses 53 
Running sum 245 

Save 69 

Scattergram 52, 61, 74, 77, 79, 82, 89, 137, 209 
Scheffé F-test 215, 222 
Scroll bars 22, 62 

Select All 107 

Select All Columns 35 
Select All Rows 36, 64 
Select Background 107 
Select Range 64 

Selecting 106 

Selecting data 35 

Selecting variables 52 
Selecting views 52 
Separator characters 41, 43 
Series 252 

Show Legend 115 

Show Ruler 109 
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Simple linear regression 79, 171 
Simple regression 80, 173 
Size 111 

Skewness 127 

Slide 5 

Small screen 3 

SMC 204 

Sort 251 

Spearman rank correlation coefficient 233 
Split Columns 142, 253 
Squared multiple correlations 204 
Standard deviation 76, 126 
Standard score 245 
Standardized Residuals 172, 188 
StatView 5 

StatView 512+ 5 

Std. Error 129 

Stepwise regression 187 
String data 26 

Style 111 

Subscripting 51 

Subset specify control 98 
Sum of squares 59 
Sunflowers 95 

t-Test 161 

Table 52, 61, 125 

Tabulated Data 227 

Text 110 

Text file format 41, 70 

Tick Marks 114- 

Time Series 253 

Toggles 93 

Tools menu 241 

Tools palette 73, 74, 93 
Transform 244 
Transformation method 198 
Transforming column 17 
Transposed data 39 

Turn Gnd On 110 

Two group t-Test 163, 165 
Unbalanced 219 
Unbalanced data 212 
Uniform Random 253 
Univariate plots 74 
Univariate scattergrams 131 
Unpaired two group t-Test 165 
Variable complexity 207 
Variable subscripting 51 
Variables menu 49 
Variance 126 

View controls 73, 93 

View menu 60 

View Zoom Preference 23 
Viewing analysis results 11 
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Wald-Wolfowitz Runs 237 
Wilcoxon Signed-Rank 232 
Window menu 53 

X-Y variable pairs 52 
z-score distribution 134 
z-score histogram 87, 135 
Zoom Down 23 

Zoom Up 23, 53 

Zooming views 53 


